Data Exploration and Wrangling
Our goal in this section is to adjust or “wrangle” the data from each year into a common format so that we can combine the data sets across years for our analysis, and so that we have values in our variables that are correct and easy to interpret. We will need to understand what is the same and what is different across the data from different years, rename and recode the variables (e.g., by replacing the numbers 1 and 2 with the values “Male” and “Female” for the Sex variable), and combine the data. We will walk through these steps below.
First, let’s take a look at our data. We can get a good sense of it using the glimpse() function of the dplyr package.
Rows: 17,711
Columns: 29
$ psu <chr> "015438", "015438", "015438", "015438", "015438", "015438"…
$ finwgt <dbl> 216.7268, 324.9620, 324.9620, 397.1552, 264.8745, 264.8745…
$ stratum <chr> "BR3", "BR3", "BR3", "BR3", "BR3", "BR3", "BR3", "BR3", "B…
$ Qn1 <dbl> 10, 9, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,…
$ Qn2 <dbl> 2, 1, 1, 1, 2, 2, 1, 2, 1, 2, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2…
$ Qn3 <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 5, 5…
$ ECIGT <dbl> 2, 1, 2, 1, 2, 1, 1, 1, 2, 2, 2, 2, 1, 2, 2, 1, 2, 1, 2, 1…
$ ECIGAR <dbl> 1, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2…
$ ESLT <dbl> 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ EELCIGT <dbl> 2, 1, 2, 1, 2, 1, 1, 1, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 1…
$ EROLLCIGTS <dbl> 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2…
$ EFLAVCIGTS <dbl> 2, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ EBIDIS <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ EFLAVCIGAR <dbl> 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 1, 2…
$ EHOOKAH <dbl> 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ EPIPE <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ ESNUS <dbl> 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ EDISSOLV <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ CCIGT <dbl> 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ CCIGAR <dbl> 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ CSLT <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ CELCIGT <dbl> 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ CROLLCIGTS <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ CFLAVCIGTS <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ CBIDIS <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ CHOOKAH <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ CPIPE <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ CSNUS <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ CDISSOLV <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
Updating the set of variables and their names
The easiest way of making it so that the data from the different years can be combined is by making sure the different data sets all contain the same variables that share the same names. In addition, giving the columns informative names will help make our code more readable. Currently, it isn’t very clear what most of the variables indicate since the variable names are uninformative on their own, without the codebook.
We want to rename variables like Qn1 to something more meaningful like Age.
To do this we will use the rename() function of the dplyr package. The new name is always listed first before the =. This function will replace the old variable names with the new ones, i.e., after running the code below, there will no longer be a Qn1 variable in the data set, but there will be an Age variable instead. We will start working with the 2015 data, and then move on to the other years down below.
AVOCADO: I later learned that it is not correct that we need the same variables in the data set when using bind_rows, so why are we creating these dummy variables with NAs here? Won’t they be automatically created later in any case?
We also need to add new variables about usage of brands and specific flavors of e-cigarettes, because although later years have questions about this, this year unfortunately only has broad questions about flavors. Eventually we want to put the data from each year together, so we need the same variables for each year. Thus we need to add variables for brand and flavor to the 2015 data set that just contain missing values (NA).
To create these new variables, we will use the mutate function of the dplyr package. The mutate function can also be used to change the existing variables and create new variables from the old ones.
Now we can see how the data has changed:
Rows: 17,711
Columns: 38
$ psu <chr> "015438", "015438", "015438", "015438", "015438"…
$ finwgt <dbl> 216.7268, 324.9620, 324.9620, 397.1552, 264.8745…
$ stratum <chr> "BR3", "BR3", "BR3", "BR3", "BR3", "BR3", "BR3",…
$ Age <dbl> 10, 9, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 1…
$ Sex <dbl> 2, 1, 1, 1, 2, 2, 1, 2, 1, 2, 2, 2, 1, 2, 2, 1, …
$ Grade <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, …
$ ECIGT <dbl> 2, 1, 2, 1, 2, 1, 1, 1, 2, 2, 2, 2, 1, 2, 2, 1, …
$ ECIGAR <dbl> 1, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ ESLT <dbl> 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, …
$ EELCIGT <dbl> 2, 1, 2, 1, 2, 1, 1, 1, 2, 2, 2, 1, 2, 2, 2, 1, …
$ EROLLCIGTS <dbl> 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, …
$ EFLAVCIGTS <dbl> 2, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ EBIDIS <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ EFLAVCIGAR <dbl> 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, …
$ EHOOKAH <dbl> 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, …
$ EPIPE <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ ESNUS <dbl> 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ EDISSOLV <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ CCIGT <dbl> 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ CCIGAR <dbl> 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ CSLT <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ CELCIGT <dbl> 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ CROLLCIGTS <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ CFLAVCIGTS <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ CBIDIS <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ CHOOKAH <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ CPIPE <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ CSNUS <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ CDISSOLV <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ brand_ecig <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ menthol <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ clove_spice <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ fruit <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ chocolate <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ alcoholic_drink <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ candy_dessert_sweets <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ other <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ no_use <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
The data for 2016-2018 have many common attributes, so we will want to write code that can be applied to all three data sets. To do this, we will use a function in R, which is basically a piece of code that can be applied to similar but different objects in R (e.g., the data tibbles from each of these three years). For more information on functions, see for example here.
AVOCADO: I added a link to the r4ds chapter on functions; are there other places in the case studies that we discuss functions? Or additional links or resources you want to add?
These next 3 years have the same structure for many of the questions we are interested in. For example, they all have flavor questions, but not a brand question. Moreover, their variable names are consistent across the years; for each year, we want to replace the vague question name Q50A with the value menthol in all three data sets, and the same is true for the other flavor variables. For each of these years, we also want to create a variable brand_ecig that just contains NA values.
Since we want to perform the same modifications on the data from all three years, rather than repeating the same somewhat messy piece of code three times, we can do this more efficiently if we create a function to do all of these steps at once. Then we can use the map_at() function of the purrr package to apply the function we created (for renaming variables etc.) to the data from 2016-2018. By using vars() inside of the map_at() function we can specify what tibbles within our nyts_data list we want to include or exclude.
Update_survey <- function(x) { x %>%
rename(Age=Q1,
Sex=Q2,
Grade=Q3,
menthol=Q50A,
clove_spice=Q50B,
fruit=Q50C,
chocolate=Q50D,
alcoholic_drink=Q50E,
candy_dessert_sweets=Q50F,
other=Q50G,
no_use=Q50H) %>%
mutate(brand_ecig=NA)
}
#few options to apply to the data:
#nyts_data <-nyts_data %>% map_at(vars(nyts2016, nyts2017, nyts2018), Update_survey)
#nyts_data <-nyts_data %>% map_at(c("nyts2016", "nyts2017", "nyts2018"), Update_survey)
nyts_data <-nyts_data %>% map_at(vars(-nyts2015, -nyts2019), Update_survey)
The final year, 2019, has a slightly different data structure compared to these earlier data sets. For example, it actually has a brand_ecig variable already, but does not contain a no_use variable. So we can’t use the same function for this data set, but have to write an individual code chunk.
nyts_data[["nyts2019"]] <- nyts_data[["nyts2019"]] %>%
rename(brand_ecig=Q40,
Age=Q1,
Sex=Q2,
Grade=Q3,
menthol=Q62A,
clove_spice=Q62B,
fruit=Q62C,
chocolate=Q62D,
alcoholic_drink=Q62E,
candy_dessert_sweets=Q62F,
other=Q62G) %>%
mutate(no_use="missing")
Now let’s take a look at the variable names for each of the years using the map function from purrr.
It’s looking better! There are still some differences in the set of variables in the different years, but things are much closer.
Updating Values
Now that we have made some progress on the selection and names of the variables themselves, we will work on the values contained in the different variables.
We can start with updating the values for Age and Grade, so that they are more understandable.
Recall from the codebook for this year’s data set that Age isn’t listed in the way one might expect, i.e., it is not just a number of years, but a numerically valued categorical variable.

The same is true for Grade:

This is why it is so important to always check the codebook!!
We also want to replace the value of 19 for Age to be ">18" and the value of 13 for Grade to be replaced with "Ungraded/Other" Also according to the codebooks, numeric values of 1 indicate a survey answer of FALSE, while a value of 2 indicates TRUE. Sex needs to be recoded, but it is a bit unclear if it is encoded slightly differently across years.
AVOCADO: For the Sex variable, how did you determine “It’s a bit difficult to tell if the encoding was the same across years” – I don’t see documentation of this (i.e., is this missing from the codebook somehow?), so it might be helpful to let the reader know how you figured this out. Carrie says to look at the codebooks and it will make sense; could consider screen shots. I looked at it again; the confusion seems to come from the SEX:RECODE variables, but we aren’t using those. Otherwise, I do think this could be dropped.
Let’s create a function to make all these updates except for Sex. We will use the mutate function again, as well as mutate_all and recode to replace specific values of certain variables.
Update_values <- function(x) { x %>%
mutate(Age=as.numeric(Age)+8,
Grade=as.numeric(Grade)+5)%>%
mutate(Age=as.factor(Age),
Grade=as.factor(Grade),
) %>%
mutate_all(~ replace(., . %in% c("*", "**"), NA))%>%
mutate(Age=recode(Age,
`19` = ">18",
),
Grade=recode(Grade,
`13` = "Ungraded/Other")) %>%
mutate_at(vars(starts_with("E", ignore.case = FALSE),
starts_with("C", ignore.case = FALSE)),
list(~recode(.,
`1` = TRUE,
`2` = FALSE,
.default = NA,
.missing = NA)))
}
nyts_data <-nyts_data %>% map(.,Update_values)
Now let’s update the Sex encoding:
It’s a bit difficult to tell if the encoding was the same across years, so let’s check using the count() function of the dplyr package.
According to the codebook, we should have:
1) 8,958 males in 2015 2) 10,438 males in 2016 3) 8,881 males in 2017
4) 10,069 males in 2018
5) 9,803 males in 2019
$nyts2015
# A tibble: 3 x 2
Sex n
<dbl> <int>
1 1 8958
2 2 8622
3 NA 131
$nyts2016
# A tibble: 3 x 2
Sex n
<dbl> <int>
1 1 10438
2 2 10082
3 NA 155
$nyts2017
# A tibble: 3 x 2
Sex n
<dbl> <int>
1 1 8881
2 2 8815
3 NA 176
$nyts2018
# A tibble: 3 x 2
Sex n
<dbl> <int>
1 1 10069
2 2 9920
3 NA 200
$nyts2019
# A tibble: 3 x 2
Sex n
<chr> <int>
1 .N 116
2 1 9803
3 2 9099
Thus, it looks like males were encoded as 1 for each year.
- 8,958 males in 2015 where 1 = male
- 10,438 males in 2016 where 1 = male
- 8,881 males in 2017 where 1 = male
- 10,069 males in 2018 where 1 = male
- 9,803 males in 2019 where 1 = male
$nyts2015
# A tibble: 3 x 2
Sex n
<fct> <int>
1 male 8958
2 female 8622
3 <NA> 131
$nyts2016
# A tibble: 3 x 2
Sex n
<fct> <int>
1 male 10438
2 female 10082
3 <NA> 155
$nyts2017
# A tibble: 3 x 2
Sex n
<fct> <int>
1 male 8881
2 female 8815
3 <NA> 176
$nyts2018
# A tibble: 3 x 2
Sex n
<fct> <int>
1 male 10069
2 female 9920
3 <NA> 200
$nyts2019
# A tibble: 3 x 2
Sex n
<fct> <int>
1 .N 116
2 male 9803
3 female 9099
Looks good!
The years (2016-2019) that have flavors also need the flavor data to be logical (meaning TRUE or FALSE):
Now there are just a few changes needed that are specific to 2019. Specifically, some of the 2019 questions use the values “.N”, “.S”, and “.Z” to indicate different types of missing data (see for example Q2 of the 2019 codebook); we just want them to be replaced with NA values.
nyts_data[["nyts2019"]] <- nyts_data[["nyts2019"]] %>%
mutate_all(~ replace(., . %in% c(".N",".S",".Z"), NA)) %>%
mutate(psu=as.character(psu)) %>%
mutate(brand_ecig = recode(brand_ecig,
`1` = "Other", #levels 1,8 combined to `Other`
`2` = "Blu",
`3` = "JUUL",
`4` = "Logic",
`5` = "MarkTen",
`6` = "NJOY",
`7` = "Vuse",
`8` = "Other"))
Great! Now our values don’t need to be handled any differently for any of the years, thus we can combine the data across years.
Even though we have different numbers of variables for each year, we can coerce the data to be combined into one tibble by using the bind_rows() function of dplyr. Importantly, this function does not require that the columns be the same. This will create NA values for any variable that is not present in given data frame but is present in one of the other data frames that is being combined. Note that the bind_cols() function does expect that the rows match. The .id argument will create a new variable with values to link each row to its original data frame. For more information see here.
Rows: 95,465
Columns: 41
$ year <chr> "nyts2015", "nyts2015", "nyts2015", "nyts2015", …
$ psu <chr> "015438", "015438", "015438", "015438", "015438"…
$ finwgt <dbl> 216.7268, 324.9620, 324.9620, 397.1552, 264.8745…
$ stratum <chr> "BR3", "BR3", "BR3", "BR3", "BR3", "BR3", "BR3",…
$ Age <fct> 18, 17, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, …
$ Sex <fct> female, male, male, male, female, female, male, …
$ Grade <fct> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, …
$ ECIGT <lgl> FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRU…
$ ECIGAR <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FA…
$ ESLT <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, …
$ EELCIGT <lgl> FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRU…
$ EROLLCIGTS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, …
$ EFLAVCIGTS <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, F…
$ EBIDIS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ EFLAVCIGAR <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, F…
$ EHOOKAH <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ EPIPE <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ ESNUS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, …
$ EDISSOLV <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ CCIGT <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ CCIGAR <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ CSLT <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ CELCIGT <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, …
$ CROLLCIGTS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ CFLAVCIGTS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ CBIDIS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ CHOOKAH <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ CPIPE <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ CSNUS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ CDISSOLV <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ brand_ecig <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ menthol <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ clove_spice <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ fruit <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ chocolate <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ alcoholic_drink <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ candy_dessert_sweets <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ other <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ no_use <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ EHTP <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ CHTP <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
We will want to do some of our analysis split by year, so we would like to be sure we have one variable that has the correct value for year. It looks like we just need to remove "nyts" from the year variable that we created from the names of the tibbles in our list and we should be all set. We will use another function from the stringr package to do this. The str_remove() function takes a string followed by a pattern and removes the pattern from the string.
Here is our clean and wrangled data:
Rows: 95,465
Columns: 41
$ year <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, …
$ psu <chr> "015438", "015438", "015438", "015438", "015438"…
$ finwgt <dbl> 216.7268, 324.9620, 324.9620, 397.1552, 264.8745…
$ stratum <chr> "BR3", "BR3", "BR3", "BR3", "BR3", "BR3", "BR3",…
$ Age <fct> 18, 17, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, …
$ Sex <fct> female, male, male, male, female, female, male, …
$ Grade <fct> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, …
$ ECIGT <lgl> FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRU…
$ ECIGAR <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FA…
$ ESLT <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, …
$ EELCIGT <lgl> FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRU…
$ EROLLCIGTS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, …
$ EFLAVCIGTS <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, F…
$ EBIDIS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ EFLAVCIGAR <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, F…
$ EHOOKAH <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ EPIPE <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ ESNUS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, …
$ EDISSOLV <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ CCIGT <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ CCIGAR <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ CSLT <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ CELCIGT <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, …
$ CROLLCIGTS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ CFLAVCIGTS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ CBIDIS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ CHOOKAH <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ CPIPE <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ CSNUS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ CDISSOLV <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ brand_ecig <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ menthol <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ clove_spice <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ fruit <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ chocolate <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ alcoholic_drink <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ candy_dessert_sweets <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ other <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ no_use <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ EHTP <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ CHTP <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
Note that there are several variables where there are similar names, but with a C compared to an E in the variable name. Those starting with C are related to questions about current usage (last 30 days), while those with an E are related to usage across the subjects whole life (“ever” usage). We will discuss these groups further below.
AVOCADO: I’m not sure where this should go, but I had the thought so I figured I’d include it. Maybe we can be more consistent about how we refer to students, subjects, respondents? Carrie suggests maybe trying to use students where we are talking about this specific data set, but respondents in the general survey weighting section.
Variable Table
Click here to see a table about the final variables in our data set.
value 1 = yes, value 2 = no
| year |
the year that the survey results from a particular student were acquired |
| psu |
the primary sampling unit for the survey weighting |
| finwgt |
the final analysis weight for the survey weighting |
| stratum |
the stratum used for variance estimation for the survey weighting |
| Age |
the age of the student when they took the survey |
| Sex |
the sex of the student when they took the survey |
| Grade |
the grade of the the student when the took the survey |
| ECIGT |
student reported having ever tried cigarette smoking, even one or two puffs |
| ECIGAR |
student reported having ever tried cigar, cigarillo, or little cigar smoking, even one or two puffs |
| ESLT |
student reported having ever tried chewing tobacco, snuff, or dip |
| EELCIGT |
student reported having ever tried e-cigarettes |
| EROLLCIGTS |
student reported having ever tried roll-your-own cigarettes |
| EFLAVCIGTS |
(2015 only) based on answer to “Which of the following tobacco products that you used in the past 30 days were flavored?” |
| EBIDIS |
student reported having ever tried bidis (small brown cigarettes wrapped in a leaf) |
| EFLAVCIGAR |
student reported having ever tried a flavored cigar (2015-2016) |
| EHOOKAH |
student reported having ever smoked tobacco from a hookah or a waterpipe |
| EPIPE |
student reported having ever smoked tobacco from a pipe (not hookah) |
| ESNUS |
student reported having ever used snus, such as Camel or Malboro Snus |
| EDISSOLV |
student reported having ever tried dissolvable tobacco products such as Ariva, Stonewall, Camel orbs, Camel sticks, Marlboro sticks, or Camel strips |
| CCIGT |
student reported they smoked cigarettes on >= 1 of the past 30 days |
| CCIGAR |
student reported they smoked cigars on >= 1 of the past 30 days |
| CSLT |
student reported they used chewing tobacco, snuff, or dip on >= 1 of the past 30 days |
| CELCIGT |
student reported they used electronic cigarettes or e-cigarettes one or more days in the past 30 |
| CROLLCIGTS |
student reported they smoked roll-your-own cigarettes during the past 30 days |
| CFLAVCIGTS |
(2015 only) based on answer to “Which of the following tobacco products that you used in the past 30 days were flavored?” |
| CBIDIS |
student reported they smoked bidis during the past 30 days |
| CHOOKAH |
student reported they smoked tobacco in a hookah on >= 1 of the past 30 days |
| CPIPE |
student reported they smoked tobacco in a pipe (not hookah) during the past 30 days |
| CSNUS |
student reported they used snus during the past 30 days |
| CDISSOLV |
student reported they used dissolvable tobacco products such as Ariva, Stonewall, Camel orbs, Camel sticks, Marlboro sticks, or Camel strips during the past 30 days |
| brand_ecig |
student answer to “During the past 30 days, what brand of e-cigarettes did you usually use?” |
| menthol |
student selected Menthol or mint as the answer to “What flavors of tobacco products have you used in the past 30 days? (select one or more)” |
| clove_spice |
student selected clove or spice as the answer to “What flavors of tobacco products have you used in the past 30 days? (select one or more)” |
| fruit |
student selected fruit as the answer to “What flavors of tobacco products have you used in the past 30 days? (select one or more)” |
| chocolate |
student selected chocolate as the answer to “What flavors of tobacco products have you used in the past 30 days? (select one or more)” |
| alcoholic_drink |
student selected alcoholic drink (such as wine, cognac, margarita, or other cocktails) as the answer to “What flavors of tobacco products have you used in the past 30 days? (choose one or more)” |
| candy_dessert_sweets |
student selected candy, desserts or other sweets as the answer to “What flavors of tobacco products have you used in the past 30 days? (choose one or more)” |
| other |
student selected some other flavor not listed as the answer to “What flavors of tobacco products have you used in the past 30 days? (choose one or more)” |
| no_use |
student reported that they did not use any flavored tobacco products in the past 30 days or did not answer the question about using flavors |
| EHTP |
student reported having ever tried heated (also known as “heat-not-burn”) tobacco products |
| CHTP |
student reported they used heated tobacco products during the past 30 days |
Data Visualization
Recall that our main questions were:
- How has tobacco and e-cigarette use by American youths changed since 2015?
- How do vaping rates compare between males and females?
What vaping brands and flavors appear to be used the most frequently?
We will base this on the following survey questions:
> “During the past 30 days, what brand of e-cigarettes did you usually use?”
>" What flavors of tobacco products have you used in the past 30 days?"
Is there a relationship between e-cigarette/vaping use and other tobacco use?
We are now going to create data visualizations to explore each of these questions.
For many of these questions we will be interested in both current and ever users, so we will want to create a variable for labeling individuals who are current users of any tobacco product (or not, i.e., who do not currently use a tobacco product) and a variable for labeling individuals who are “ever users” of any tobacco product (or not, i.e., who have never used a tobacco product).
We define these two groups as follows:
- current = students who used a product for >=1 day in the past 30 days
- ever = students who report having used or tried a product at any point in time
All current users are therefore ever users but not all ever users are current users. Thus, current users are a subset of ever users.
To add these grouping variables to our data we will do a bit more wrangling using the mutate() function again of the dplyr package. As discussed above, our data set contains a set of questions that relate to whether the student has ever used the particular tobacco product (questions that start with the letter “E”), and questions that relate to whether the student currently uses the particular tobacco product (questions that start with the letter “C”).
Here are some examples for these data entries:
- EPIPE: Students who reported they have smoked tobacco from a pipe (not hookah).
- CPIPE: Students who reported they smoked tobacco in a pipe (not hookah) during the past 30 days.
- EROLLCIGTS: RECODE: Students who reported they have tried smoking roll-your-own cigarettes.
- CROLLCIGTS: RECODE: Students who reported they smoked roll-your-own cigarettes during the past 30 days.
Based on many questions like this:
In the past 30 days, which of the following products have you used on at least one day? (Select one or more) A. Roll-your-own cigarettes
B. Pipes filled with tobacco (not hookah or waterpipe)
C. Snus, such as Camel, Marlboro, or General Snus
D. Dissolvable tobacco products such as Ariva, Stonewall, Camel orbs, Camel sticks, Marlboro sticks, or Camel strips
E. Bidis (small brown cigarettes wrapped in a leaf)
F. I have not used any of the products listed above in the past 30 days
Which of the following tobacco products have you ever tried, even just one time? (Select one or more)
A. Roll-your-own cigarettes
B. Pipes filled with tobacco (not hookah or waterpipe)
C. Snus, such as Camel, Marlboro, or General Snus
D. Dissolvable tobacco products such as Ariva, Stonewall, Camel orbs, Camel sticks, Marlboro sticks, or Camel strips
E. Bidis (small brown cigarettes wrapped in a leaf)
F. I have never tried any of the products listed above
We will sum across the variables that relate to ever or current tobacco usage questions to determine if the student answered yes to any of the ever or current questions.
We will then use the case_when() function of the dplyr package to convert the sum values to TRUE or FALSE based on the threshold of zero. If the sum is greater than zero, then we know the student answered yes to at least one question.
nyts_data <-nyts_data %>%
mutate(tobacco_sum_ever = select(., starts_with("E", ignore.case = FALSE)) %>%
base::apply(1, sum, na.rm=TRUE),
tobacco_sum_current = select(., starts_with("C", ignore.case = FALSE)) %>%
apply(1, sum, na.rm=TRUE)) %>%
mutate(tobacco_ever = case_when(tobacco_sum_ever > 0 ~ TRUE,
tobacco_sum_ever == 0 ~ FALSE),
tobacco_current = case_when(tobacco_sum_current > 0 ~ TRUE,
tobacco_sum_current == 0 ~ FALSE))
Rows: 95,465
Columns: 45
$ year <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, …
$ psu <chr> "015438", "015438", "015438", "015438", "015438"…
$ finwgt <dbl> 216.7268, 324.9620, 324.9620, 397.1552, 264.8745…
$ stratum <chr> "BR3", "BR3", "BR3", "BR3", "BR3", "BR3", "BR3",…
$ Age <fct> 18, 17, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, …
$ Sex <fct> female, male, male, male, female, female, male, …
$ Grade <fct> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, …
$ ECIGT <lgl> FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRU…
$ ECIGAR <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FA…
$ ESLT <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, …
$ EELCIGT <lgl> FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRU…
$ EROLLCIGTS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, …
$ EFLAVCIGTS <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, F…
$ EBIDIS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ EFLAVCIGAR <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, F…
$ EHOOKAH <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ EPIPE <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ ESNUS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, …
$ EDISSOLV <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ CCIGT <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ CCIGAR <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ CSLT <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ CELCIGT <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, …
$ CROLLCIGTS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ CFLAVCIGTS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ CBIDIS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ CHOOKAH <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ CPIPE <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ CSNUS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ CDISSOLV <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ brand_ecig <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ menthol <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ clove_spice <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ fruit <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ chocolate <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ alcoholic_drink <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ candy_dessert_sweets <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ other <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ no_use <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ EHTP <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ CHTP <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ tobacco_sum_ever <int> 1, 4, 0, 3, 0, 2, 8, 4, 0, 0, 0, 1, 1, 0, 0, 4, …
$ tobacco_sum_current <int> 0, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ tobacco_ever <lgl> TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE…
$ tobacco_current <lgl> FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, F…
We are also interested in e-cigarette usage compared to other tobacco products, so we will create some variables related to the sum of all e-cigarette usage question variables and the sum of all tobacco usage question variables excluding those that are about e-cigarettes. There is only one variable about e-cigarette usage ever (EELCIGT) and one about current usage (CELCIGT).
nyts_data <-nyts_data %>% mutate(ecig_sum_ever = select(., EELCIGT) %>%
apply(1, sum, na.rm=TRUE),
ecig_sum_current = select(., CELCIGT) %>%
apply(1, sum, na.rm=TRUE),
non_ecig_sum_ever = select(., starts_with("E", ignore.case = FALSE)) %>%
select(.,-EELCIGT) %>%
apply(1, sum, na.rm=TRUE),
non_ecig_sum_current = select(., starts_with("C", ignore.case = FALSE)) %>%
select(.,-CELCIGT) %>%
apply(1, sum, na.rm=TRUE)) %>%
mutate(ecig_ever = case_when(ecig_sum_ever > 0 ~ TRUE,
ecig_sum_ever == 0 ~ FALSE),
ecig_current = case_when(ecig_sum_current > 0 ~ TRUE,
ecig_sum_current == 0 ~ FALSE),
non_ecig_ever = case_when(non_ecig_sum_ever > 0 ~ TRUE,
non_ecig_sum_ever == 0 ~ FALSE),
non_ecig_current = case_when(non_ecig_sum_current > 0 ~ TRUE,
non_ecig_sum_current == 0 ~ FALSE))
Rows: 95,465
Columns: 53
$ year <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, …
$ psu <chr> "015438", "015438", "015438", "015438", "015438"…
$ finwgt <dbl> 216.7268, 324.9620, 324.9620, 397.1552, 264.8745…
$ stratum <chr> "BR3", "BR3", "BR3", "BR3", "BR3", "BR3", "BR3",…
$ Age <fct> 18, 17, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, …
$ Sex <fct> female, male, male, male, female, female, male, …
$ Grade <fct> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, …
$ ECIGT <lgl> FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRU…
$ ECIGAR <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FA…
$ ESLT <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, …
$ EELCIGT <lgl> FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRU…
$ EROLLCIGTS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, …
$ EFLAVCIGTS <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, F…
$ EBIDIS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ EFLAVCIGAR <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, F…
$ EHOOKAH <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ EPIPE <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ ESNUS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, …
$ EDISSOLV <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ CCIGT <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ CCIGAR <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ CSLT <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ CELCIGT <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, …
$ CROLLCIGTS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ CFLAVCIGTS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ CBIDIS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ CHOOKAH <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ CPIPE <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ CSNUS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ CDISSOLV <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ brand_ecig <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ menthol <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ clove_spice <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ fruit <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ chocolate <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ alcoholic_drink <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ candy_dessert_sweets <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ other <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ no_use <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ EHTP <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ CHTP <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ tobacco_sum_ever <int> 1, 4, 0, 3, 0, 2, 8, 4, 0, 0, 0, 1, 1, 0, 0, 4, …
$ tobacco_sum_current <int> 0, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ tobacco_ever <lgl> TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE…
$ tobacco_current <lgl> FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, F…
$ ecig_sum_ever <int> 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, …
$ ecig_sum_current <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ non_ecig_sum_ever <int> 1, 3, 0, 2, 0, 1, 7, 3, 0, 0, 0, 0, 1, 0, 0, 3, …
$ non_ecig_sum_current <int> 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ ecig_ever <lgl> FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRU…
$ ecig_current <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, …
$ non_ecig_ever <lgl> TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE…
$ non_ecig_current <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, …
Finally, we are also interested in grouping students that only use e-cigarettes and those that only use other forms of tobacco.
Recall that current users are a subset of ever users, thus students would typically answer yes to having tried vaping products if they had used them one or more days in the past 30 days.
First we will make a small toy dataset called test to show what we will do with the larger dataset:
test <- tibble(ecig_ever = c("TRUE", "TRUE", "TRUE", "TRUE", "FALSE", "FALSE", "TRUE", "FALSE", "FALSE"),
non_ecig_ever = c("TRUE", "FALSE","FALSE","FALSE", "FALSE", "FALSE", "TRUE", "TRUE", "TRUE"),
ecig_current = c("TRUE", "FALSE","FALSE", "TRUE", "TRUE", "FALSE", "FALSE", "FALSE", "FALSE"),
non_ecig_current = c("TRUE", "FALSE","TRUE", "FALSE", "TRUE", "FALSE", "FALSE", "FALSE", "TRUE"))
test
# A tibble: 9 x 4
ecig_ever non_ecig_ever ecig_current non_ecig_current
<chr> <chr> <chr> <chr>
1 TRUE TRUE TRUE TRUE
2 TRUE FALSE FALSE FALSE
3 TRUE FALSE FALSE TRUE
4 TRUE FALSE TRUE FALSE
5 FALSE FALSE TRUE TRUE
6 FALSE FALSE FALSE FALSE
7 TRUE TRUE FALSE FALSE
8 FALSE TRUE FALSE FALSE
9 FALSE TRUE FALSE TRUE
Now, let’s look at identifying students who have tried e-cigarettes, but are not current users, and who have never tried other tobacco products (and are therefore not current users). We will again use the case_when() and the mutate function to create new variables with specific values when certain conditions are met. In this case, we will specify that several conditions must be met by using the & symbol. For a value of TRUE for the new ecig_only_ever variable, all of the conditions combined with & must be met. If any of the conditions are not met then the ecig_only_ever value will be FALSE based on the last line TRUE ~ FALSE.
# A tibble: 9 x 5
ecig_ever non_ecig_ever ecig_current non_ecig_current ecig_only_ever
<chr> <chr> <chr> <chr> <lgl>
1 TRUE TRUE TRUE TRUE FALSE
2 TRUE FALSE FALSE FALSE TRUE
3 TRUE FALSE FALSE TRUE FALSE
4 TRUE FALSE TRUE FALSE FALSE
5 FALSE FALSE TRUE TRUE FALSE
6 FALSE FALSE FALSE FALSE FALSE
7 TRUE TRUE FALSE FALSE FALSE
8 FALSE TRUE FALSE FALSE FALSE
9 FALSE TRUE FALSE TRUE FALSE
We can see from the second row, that the ecig_only_ever is TRUE when we would expect it to be. We can also see from the fourth row, that even though the student reported yes to ever trying e-cigarettes, because they also reported yes to currently using e-cigarettes the value for only ever trying e-cigarettes is FALSE. Additionally we can see from the seventh row that similarly even though the student reported yest to ever trying e-cigarettes, they also reported yes to ever trying other products, and the value for only ever trying e-cigarettes is FALSE. Importantly, we can see from the 6th row, that if all responses are negative than the value is FALSE.
Now we will expand this to the other possible categories. In this case we note that since current users are a subset of ever users, it doesn’t matter if a user reports yes to ever trying e-cigarettes, they can still be a current user.
Rows: 9
Columns: 9
$ ecig_ever <chr> "TRUE", "TRUE", "TRUE", "TRUE", "FALSE", "FALSE…
$ non_ecig_ever <chr> "TRUE", "FALSE", "FALSE", "FALSE", "FALSE", "FA…
$ ecig_current <chr> "TRUE", "FALSE", "FALSE", "TRUE", "TRUE", "FALS…
$ non_ecig_current <chr> "TRUE", "FALSE", "TRUE", "FALSE", "TRUE", "FALS…
$ ecig_only_ever <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ ecig_only_current <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE,…
$ non_ecig_only_ever <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ non_ecig_only_current <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ no_use <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,…
Take a minute to check that the values are what we would expect.
OK, now we are going to make a Group variable based on the new variables we just made to classify students into one of four mutually exclusive and exhaustive categories. In this case we will have a particular value based on one condition or another. This or conditional is specified by the | symbol. Only one of the conditions needs to exist for that particular value, whereas when we used the & symbol, all of the conditions had to be met. If a student has ever tried or currently uses e-cigarettes, but has never tried other tobacco products, the value will be Only e-cigarettes. If a student has ever tried or is a current user of other tobacco products, but has never tried e-cigarettes, the value will be Only other products. If the value of the no_use variable is simply TRUE, then the Group variable value will be Neither. Finally, if a student has tried or currently uses both e-cigarettes and other tobacco products the Group variable value will be Combination of products. Therefore the values for the usage of the variables based on only using e-cigarettes or only other products will all be FALSE. Of course the student will also have had to answer yes to any of the use-based questions, thus the no_use variable would also need to be FALSE.
AVOCADO: Consider dropping the last two sentences in the paragraph above? i.e., from “Therefore the values…” to “would also need to be FALSE.” I’m not sure what they are adding. LEAVE IN FOR CARRIE TO REVIEW.
# A tibble: 4 x 2
Group n
<chr> <int>
1 Combination of products 4
2 Neither 1
3 Only e-cigarettes 2
4 Only other products 2
Rows: 9
Columns: 10
$ ecig_ever <chr> "TRUE", "TRUE", "TRUE", "TRUE", "FALSE", "FALSE…
$ non_ecig_ever <chr> "TRUE", "FALSE", "FALSE", "FALSE", "FALSE", "FA…
$ ecig_current <chr> "TRUE", "FALSE", "FALSE", "TRUE", "TRUE", "FALS…
$ non_ecig_current <chr> "TRUE", "FALSE", "TRUE", "FALSE", "TRUE", "FALS…
$ ecig_only_ever <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ ecig_only_current <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE,…
$ non_ecig_only_ever <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ non_ecig_only_current <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ no_use <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,…
$ Group <chr> "Combination of products", "Only e-cigarettes",…
OK, now that we have seen how this works with our toy dataset, we will apply our code to our nyts_data.
nyts_data %<>%
mutate(ecig_only_ever = case_when(ecig_ever == TRUE &
non_ecig_ever == FALSE &
ecig_current == FALSE &
non_ecig_current == FALSE ~ TRUE,
TRUE ~ FALSE),
ecig_only_current = case_when(ecig_current == TRUE &
non_ecig_ever == FALSE &
non_ecig_current == FALSE ~ TRUE,
TRUE ~ FALSE),
non_ecig_only_ever = case_when(non_ecig_ever == TRUE &
ecig_ever == FALSE &
ecig_current == FALSE &
non_ecig_current == FALSE ~ TRUE,
TRUE ~ FALSE),
non_ecig_only_current = case_when(non_ecig_current == TRUE &
ecig_ever == FALSE &
ecig_current == FALSE ~ TRUE,
TRUE ~ FALSE),
no_use = case_when(non_ecig_ever == FALSE &
ecig_ever == FALSE &
ecig_current == FALSE &
non_ecig_current == FALSE ~TRUE,
TRUE ~ FALSE)) %>%
mutate(Group = case_when(ecig_only_ever == TRUE |
ecig_only_current == TRUE ~ "Only e-cigarettes",
non_ecig_only_ever == TRUE |
non_ecig_only_current == TRUE ~ "Only other products",
no_use == TRUE ~ "Neither",
ecig_only_ever == FALSE &
ecig_only_current == FALSE &
non_ecig_only_ever == FALSE &
non_ecig_only_current == FALSE &
no_use == FALSE ~ "Combination of products"))
Lastly, it can be very helpful to have the total number of students surveyed each year. We can easily add a variable for this by using the add_count() function of the dplyr package. This will create a variable called n which will show the total number of survey responses for that year.
Rows: 95,465
Columns: 59
$ year <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015,…
$ psu <chr> "015438", "015438", "015438", "015438", "015438…
$ finwgt <dbl> 216.7268, 324.9620, 324.9620, 397.1552, 264.874…
$ stratum <chr> "BR3", "BR3", "BR3", "BR3", "BR3", "BR3", "BR3"…
$ Age <fct> 18, 17, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18,…
$ Sex <fct> female, male, male, male, female, female, male,…
$ Grade <fct> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,…
$ ECIGT <lgl> FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TR…
$ ECIGAR <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, F…
$ ESLT <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE,…
$ EELCIGT <lgl> FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TR…
$ EROLLCIGTS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE,…
$ EFLAVCIGTS <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, …
$ EBIDIS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ EFLAVCIGAR <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, …
$ EHOOKAH <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ EPIPE <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ ESNUS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE,…
$ EDISSOLV <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ CCIGT <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ CCIGAR <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ CSLT <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ CELCIGT <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE,…
$ CROLLCIGTS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ CFLAVCIGTS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ CBIDIS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ CHOOKAH <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ CPIPE <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ CSNUS <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ CDISSOLV <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ brand_ecig <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ menthol <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ clove_spice <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ fruit <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ chocolate <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ alcoholic_drink <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ candy_dessert_sweets <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ other <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ no_use <lgl> FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, …
$ EHTP <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ CHTP <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ tobacco_sum_ever <int> 1, 4, 0, 3, 0, 2, 8, 4, 0, 0, 0, 1, 1, 0, 0, 4,…
$ tobacco_sum_current <int> 0, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ tobacco_ever <lgl> TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRU…
$ tobacco_current <lgl> FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, …
$ ecig_sum_ever <int> 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1,…
$ ecig_sum_current <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ non_ecig_sum_ever <int> 1, 3, 0, 2, 0, 1, 7, 3, 0, 0, 0, 0, 1, 0, 0, 3,…
$ non_ecig_sum_current <int> 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ ecig_ever <lgl> FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TR…
$ ecig_current <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE,…
$ non_ecig_ever <lgl> TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRU…
$ non_ecig_current <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ ecig_only_ever <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ ecig_only_current <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ non_ecig_only_ever <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ non_ecig_only_current <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ Group <chr> "Only other products", "Combination of products…
$ n <int> 17711, 17711, 17711, 17711, 17711, 17711, 17711…
Question 1
Recall that we are interested in investigating how vaping product use has compared with other tobacco use over time. To answer this, we first want to get a sense of how tobacco use has changed in general since 2015.
To create a visualization of how tobacco usage has changed over time, we will first convert the usage data to a percent value for each year, telling us what percent of respondents fall into a particular usage category. To do this we will use the group_by() and summarize() functions of the dplyr package. This will create new variables which we will name Ever and Current based on the percentages of TRUE values for tobacco_ever and tobacco_current for each year. In this case the mean() function is used to calculate the percentages based on an automatic conversion that R does where for TRUE/FALSE variables, TRUE is given a value of one and FALSE is given a value of zero. The mean of a 0-1 binary variable is just the percent of the time the value is 1. NA values do not contribute to the total count when we include the argument na.rm = TRUE to our function call.
Click here to see a toy example:
# A tibble: 10 x 1
var1
<lgl>
1 TRUE
2 TRUE
3 TRUE
4 FALSE
5 FALSE
6 FALSE
7 FALSE
8 FALSE
9 FALSE
10 FALSE
# A tibble: 1 x 1
Percentage
<dbl>
1 30
# A tibble: 10 x 1
var1
<lgl>
1 TRUE
2 TRUE
3 TRUE
4 FALSE
5 FALSE
6 FALSE
7 NA
8 NA
9 NA
10 NA
# A tibble: 1 x 1
Percentage
<dbl>
1 50
And now back to our data:
# A tibble: 5 x 3
year Ever Current
<dbl> <dbl> <dbl>
1 2015 36.8 18.0
2 2016 33.4 14.0
3 2017 31.8 14.4
4 2018 34.7 18.7
5 2019 39.7 22.4
We will use the pivot_longer function to take all columns except year (in this case the Ever and Current columns), to create a column called User that will contain the current column name information and a column called Percentage of students which will contain the mean percentage values that we just calculated. This converts our data into a format called “long” format.
# A tibble: 10 x 3
year User `Percentage of students`
<dbl> <chr> <dbl>
1 2015 Ever 36.8
2 2015 Current 18.0
3 2016 Ever 33.4
4 2016 Current 14.0
5 2017 Ever 31.8
6 2017 Current 14.4
7 2018 Ever 34.7
8 2018 Current 18.7
9 2019 Ever 39.7
10 2019 Current 22.4
You may have noticed that our data is longer than it used to be! Hence the name of the function pivot_longer(). Data is often easier to plot when it is in this format.
Now we will use this data to create a plot using the ggplot2 package.
The first thing we do to create a plot is specify what data we are using for our x axis and y axis with theaes() argument of the ggplot() function. Then we add layers to our plot that specify what type of plot we would like to create. We can use the geom_line() function to create lines for each type of user. We specify that we want to use different line types for each user using aes(). We will also add points to our lines using the geom_point() function. We can add additional layers to specify colors and details about labels and legends etc.
plot1 <- nyts_data %>%
dplyr::group_by(year) %>%
dplyr::summarize(Ever=(mean(tobacco_ever, na.rm = TRUE)*100),
Current=(mean(tobacco_current, na.rm = TRUE)*100))%>%
pivot_longer(cols = -year, names_to = "User", values_to = "Percentage of students")%>%
ggplot(aes(x=year,y=`Percentage of students`)) +
geom_line(aes(linetype=User)) +
geom_point(show.legend = FALSE, size = 2) +
# this allows us to choose what type of line we want for each line
scale_linetype_manual(values = c(2,1)) +
# this allows us to specify how the y-axis should appear
scale_y_continuous(breaks = seq(0,70,by=10),
labels = seq(0,70,by=10),
limits = c(0,70)) +
#this adjusts the background style of the plot
theme_linedraw() +
# this moves the legend to the bottom of the plot and removes the x axis title
theme(legend.position = "bottom",
axis.title.x = element_blank()) +
labs(title = "How has tobacco use varied over the years?",
y = "% of students")
plot1

Nice! Now we can see how overall tobacco usage has changed since 2017. It appears that usage first decreased from 2015 to 2017 and then increased a bit since 2017, surpassing the levels in 2015.
What about e-cigarette use? How has the usage of e-cigarettes changed over time?
plot1a <- nyts_data %>%
dplyr::group_by(year) %>%
dplyr::summarize(Ever=(mean(ecig_ever, na.rm = TRUE)*100),
Current=(mean(ecig_current, na.rm = TRUE)*100))%>%
pivot_longer(cols = -year, names_to = "User", values_to = "Percentage of students")%>%
ggplot(aes(x=year,y=`Percentage of students`)) +
geom_line(aes(linetype=User)) +
geom_point(show.legend = FALSE, size = 2) +
# this allows us to choose what type of line we want for each line
scale_linetype_manual(values = c(2,1)) +
# this allows us to specify how the y-axis should appear
scale_y_continuous(breaks = seq(0,60,by=10),
labels = seq(0,60,by=10),
limits = c(0,60)) +
#this adjusts the background style of the plot
theme_linedraw() +
# this moves the legend to the bottom of the plot and removes the x axis title
theme(legend.position = "bottom",
axis.title.x = element_blank()) +
labs(title = "How has e-cigarette use varied over the years?",
y = "% of students")
plot1a
It looks like the shape of the plot is very similar to tobacco usage overall. We see a downward trend until 2017 when the rate of both current and ever users increased. Recall that this is in agreement with the articles that we referenced earlier. We can see that the slope looks steeper for e-cigarette usage as compared to all tobacco products (including e-cigarettes).
Now let’s plot this data together on the same plot.
We will have four groups (e-cigarette ever users, e-cigarette current users, tobacco ever users, and tobacco current users) to plot, therefore, it would be useful to add color to our plot. Keep in mind that e-cigarette users are a subset of any tobacco product users.
One important thing to keep in mind when creating plots is that individuals with color blindness may have a difficult time distinguishing groups when certain color choices are used.
One great option is to use the viridis package, which offers color palettes with colors that are still distinguishable by individuals with most forms of color blindness.
We can choose which colors we want to use by using the show_col() function of the scales package.
Here are some color options:

We will select the first and fourth colors for our plot. To add these specific colors we will use the scale_color_manual() function of the ggplot2 package.
We will calculate the mean ever and current usage percentages for students who used e-cigarettes or any tobacco products (including e-cigarettes) for each year again using the group_by() and summarize() functions. We will again use the pivot_longer function to convert our data to long format. We will also use the separate() function of the tidyr package to create two variables from one of the variables. This is done by separating by, in this case, an underscore.
nyts_data %>%
dplyr::group_by(year) %>%
dplyr::summarize("Ever_Any Tobacco Product \n (including e-cigarettes)"=(mean(tobacco_ever, na.rm = TRUE)*100),
"Current_Any Tobacco Product \n (including e-cigarettes)"=(mean(tobacco_current, na.rm = TRUE)*100),
"Ever_E-cigarettes"=(mean(ecig_ever, na.rm = TRUE)*100),
"Current_E-cigarettes"=(mean(ecig_current, na.rm = TRUE)*100))%>%
pivot_longer(cols = -year, names_to = "User", values_to = "Percentage of students")%>%
separate(User, into = c("User", "Product"), sep = "_")%>%
head()
# A tibble: 6 x 4
year User Product `Percentage of studen…
<dbl> <chr> <chr> <dbl>
1 2015 Ever "Any Tobacco Product \n (including e-cig… 36.8
2 2015 Current "Any Tobacco Product \n (including e-cig… 18.0
3 2015 Ever "E-cigarettes" 26.4
4 2015 Current "E-cigarettes" 11.0
5 2016 Ever "Any Tobacco Product \n (including e-cig… 33.4
6 2016 Current "Any Tobacco Product \n (including e-cig… 14.0
plot1t <- nyts_data %>%
dplyr::group_by(year) %>%
dplyr::summarize("Ever_Any Tobacco Product \n (including e-cigarettes)"=(mean(tobacco_ever, na.rm = TRUE)*100),
"Current_Any Tobacco Product \n (including e-cigarettes)"=(mean(tobacco_current, na.rm = TRUE)*100),
"Ever_E-cigarettes"=(mean(ecig_ever, na.rm = TRUE)*100),
"Current_E-cigarettes"=(mean(ecig_current, na.rm = TRUE)*100))%>%
pivot_longer(cols = -year, names_to = "User", values_to = "Percentage of students")%>%
separate(User, into = c("User", "Product"), sep = "_")%>%
ggplot(aes(x=year,y=`Percentage of students`, color = Product)) +
geom_line(aes(linetype=User)) +
geom_point(show.legend = FALSE, size = 2) +
# this allows us to choose what type of line we want for each line
scale_linetype_manual(values = c(2,1)) +
# we want purple associated with e-cigarettes to be consistent with later plots
scale_color_manual(values = rev(v_colors))+
# this allows us to specify how the y-axis should appear
scale_y_continuous(breaks = seq(0,60,by=10),
labels = seq(0,60,by=10),
limits = c(0,60)) +
#this adjusts the background style of the plot
theme_linedraw() +
# this moves the legend to the bottom of the plot and removes the x axis title
theme(legend.position = "bottom",
axis.title.x = element_blank()) +
labs(title = "How has tobacco use varied over the years?",
y = "% of students")
plot1t

We see an increase in all categories starting in 2017, but the rate of increase is higher for students using only e-cigarettes (current or ever users), as shown by the higher slope of the e-cigarette lines.
In the above plots, the “Any tobacco product” groups include individuals in the “E-cigarette only” groups. Now let’s plot students in two mutually exclusive groups on the same plot: those who reported either using only e-cigarettes or only other tobacco products (besides e-cigarettes), but not both.
We will calculate the mean ever and current usage percentages for students in these two mutually exclusive groups, again using the group_by() function and the summarize() function. We will again use the pivot_longer function to convert our data to long format. We will also again use the separate() function of the tidyr package to create two variables from one variable. This is done by separating by, in this case, a space.
nyts_data %>%
dplyr::group_by(year) %>%
dplyr::summarize("Ever_E-cigarette"=(mean(ecig_only_ever, na.rm = TRUE)*100),
"Current_E-cigarette"=(mean(ecig_only_current, na.rm = TRUE)*100),
"Ever_Non-e-cigarette" =(mean(non_ecig_only_ever, na.rm = TRUE)*100),
"Current_Non-e-cigarette"=(mean(non_ecig_only_current, na.rm = TRUE)*100)) %>%
pivot_longer(cols = -year, names_to = "User", values_to = "Percentage of students") %>%
tidyr::separate(User, into = c("User", "Product"), sep = "_") %>%
head()
# A tibble: 6 x 4
year User Product `Percentage of students`
<dbl> <chr> <chr> <dbl>
1 2015 Ever E-cigarette 4.36
2 2015 Current E-cigarette 1.54
3 2015 Ever Non-e-cigarette 7.06
4 2015 Current Non-e-cigarette 3.35
5 2016 Ever E-cigarette 4.54
6 2016 Current E-cigarette 1.23
plot1c <- nyts_data %>%
dplyr::group_by(year) %>%
dplyr::summarize("Ever_E-cigarette"=(mean(ecig_only_ever, na.rm = TRUE)*100),
"Current_E-cigarette"=(mean(ecig_only_current, na.rm = TRUE)*100),
"Ever_Non-e-cigarette" =(mean(non_ecig_only_ever, na.rm = TRUE)*100),
"Current_Non-e-cigarette"=(mean(non_ecig_only_current, na.rm = TRUE)*100))%>%
pivot_longer(cols = -year, names_to = "User", values_to = "Percentage of students")%>%
separate(User, into = c("User", "Product"), sep = "_") %>%
ggplot(aes(x=year,y=`Percentage of students`, color = Product)) +
geom_line(aes(linetype=User)) +
geom_point(show.legend = FALSE, size = 2) +
# this allows us to choose what type of line we want for each line
scale_linetype_manual(values = c(2,1)) +
# this allows us to specify how the y-axis should appear
scale_y_continuous(breaks = seq(0,30,by=10),
labels = seq(0,30,by=10),
limits = c(0,30)) +
scale_color_manual(values = v_colors)+
#this adjusts the background style of the plot
theme_linedraw() +
# this moves the legend to the bottom of the plot and removes the x axis title
theme(legend.position = "bottom",
axis.title.x = element_blank()) +
labs(title = "How has use of only e-cigarettes and \n only tobacco products besides e-cigarettes varied over time?",
y = "% of students")
plot1c

Very interesting! We can see from this plot that the percentage of students who had currently used (or ever tried) only e-cigarettes greatly increased starting in 2017, while in contrast the percentage of students who had ever tried only non-e-cigarette tobacco products actually diminished over time. In fact, we can see that in 2019 the percentage of students who were current e-cigarette users surpassed the percentage that had tried a non-e-cigarette product even just once.
Recall that we made a variable called Group that identified students who used either just e-cigarette products, just other tobacco products (besides e-cigarettes), or students who used both e-cigarettes and some other type of tobacco product.
# A tibble: 4 x 2
Group n
<chr> <int>
1 Combination of products 315716730
2 Neither 1183798832
3 Only e-cigarettes 151032421
4 Only other products 179275592
We will now make a plot over time of each of these groups. Since we will have 4 total groups, we will use 4 of the viridis colors. Notice, that in this case we are grouping by three variables by simply separating the variables that we want to group by with a comma in our group_by() function like this: group_by(Group, year, n).
# A tibble: 6 x 5
# Groups: Group, year [6]
Group year n group_count `Percentage of students`
<chr> <dbl> <int> <int> <dbl>
1 Combination of products 2015 17711 3634 20.5
2 Combination of products 2016 20675 3297 15.9
3 Combination of products 2017 17872 2623 14.7
4 Combination of products 2018 20189 3321 16.4
5 Combination of products 2019 19018 3642 19.2
6 Neither 2015 17711 11188 63.2

We can see that the majority of students did not report using any tobacco products. Of the students that did report using tobacco products, the majority of the students used both e-cigarettes and some other tobacco product. Again, a much larger percentage reported using only e-cigarettes rather than only other tobacco products in 2019.
We will further explore the relationship between e-cigarette usage and other tobacco products a bit later in the case study.
Question 2
Now we want to look how e-cigarette smoking rates compare between males and females across time.
We will calculate the percent ever and current e-cigarette users for each year and sex category again using the group_by() function and the summarize() function. We will again use the pivot_longer function to convert our data to long format. Because there are some missing values for the Sex variable, we will filter these out; based on the codebook, there is little information about why these values are missing, so our survey data only allows us to consider two reported sex values with confidence.
AVOCADO: Carrie, I am filtering out the NAs for Sex now; can you check/modify the language I included above? I feel like you put it more eloquently on the phone.
AVOCADO: Also, in this plot title e-cigarette is changed to vaping – is that on purpose? Should we make that change earlier or throughout the case study? Or stick with e-cigarette?
# A tibble: 6 x 4
# Groups: year [2]
year Sex User `Percentage of students`
<dbl> <fct> <chr> <dbl>
1 2015 male Ever 29.3
2 2015 male Current 13.3
3 2015 female Ever 24.3
4 2015 female Current 9.05
5 2016 male Ever 24.1
6 2016 male Current 8.72
plot2 <- nyts_data %>%
filter(!is.na(Sex)) %>%
group_by(year, Sex) %>%
summarize(Ever=(mean(EELCIGT, na.rm = TRUE)*100),
Current=(mean(CELCIGT, na.rm = TRUE)*100))%>%
# filter(!is.na(Sex)) %>%
pivot_longer(cols = Ever:Current,
names_to = "User",
values_to = "Percentage of students") %>%
ggplot(aes(x = year,y =`Percentage of students`, color = Sex)) +
geom_line(aes(linetype = User)) +
geom_point(show.legend = FALSE, size = 2) +
scale_linetype_manual(values = c(2,1)) +
scale_color_manual(values =v_colors)+
theme_linedraw() +
theme(legend.position = "bottom",
# axis.text.x = element_text(angle = 90),
axis.title.x = element_blank()) +
labs(title = "How do vaping rates compare between males and females?",
subtitle = "Current and ever users by sex",
y = "% of students")
plot2

It looks like the rates are fairly similar between the sexes, however the rate for males appears to be consistently higher across time.
Question 3
We are also interested in what vaping brands and flavors appear to be used the most frequently. Only the 2019 data set has this information. Therefore, we will filter for just this year using the filter() function of the dplyr package. We will use the summarize() function slightly differently this time, to calculate the total number of students using each brand using the n() function and the sum() function to calculate the percent for each brand based on the counts. We will also reorder the factor levels for the brand names so that they are in descending order of percent use, using the fct_reorder() function from dplyr. This will make them appear in decreasing order of percent use on the plot.
We will make a bar plot this time by using geom_bar. Importantly we assign the stat argument to identity, so that we are using the percentages that we calculated not the counts which is what is used by default. When color in specified outside of the aes() argument, this determines the border color of the bars, which in this case will be black.
# A tibble: 7 x 4
brand_ecig n total Percent
<fct> <int> <int> <dbl>
1 Blu 111 3604 3.08
2 JUUL 2028 3604 56.3
3 Logic 36 3604 0.999
4 MarkTen 32 3604 0.888
5 NJOY 44 3604 1.22
6 Other 1253 3604 34.8
7 Vuse 100 3604 2.77

Juul appears to be the most widely used brand. This is in agreement with a recent article, whose most recent data was from 2017:
We are also interested in how the usage of different flavors has changed over time.
To evaluate this we will calculate the percentage of students using each flavor each year - this includes usage of any type of flavored tobacco product. We will exclude 2015 data, as no specific flavor questions were asked at that time.
Recall that NA values are not included in calculating the total count for our percentages. However all of these flavor questions had complete reporting and did not have NA values. Therefore, these values reflect the percentage of students reporting using a particular favor out of all students surveyed (including those that did not use any tobacco products). Also students were allowed to select more than one flavor. You can see whether these variables had complete reporting by checking the NA values using the base summary function. Alternatively you can create a visual representation using the vis_miss() function of the naniar package.

The plot above confirms that these variables have no NA values (because all fields indicate 100% of data is present).
plot4 <- nyts_data %>%
filter(year!=2015) %>%
group_by(year) %>%
summarize(Menthol = (mean(menthol)*100),
`Clove or Spice` = (mean(clove_spice)*100),
Fruit = (mean(fruit)*100),
Chocolate = (mean(chocolate)*100),
`Alcoholic Drink` = (mean(alcoholic_drink)*100),
`Candy/Desserts/Sweets` = (mean(candy_dessert_sweets)*100),
Other = (mean(other)*100)) %>%
pivot_longer(cols = - year, names_to = "Flavor", values_to = "Percentage of students") %>%
rename(Year = year) %>%
ggplot(aes(y = `Percentage of students`,
x = Year,
fill = reorder(Flavor, `Percentage of students`)))+
geom_bar(stat = "identity", position = "dodge", color = "black")+
scale_fill_viridis(discrete=TRUE)+
theme_linedraw() +
guides(fill=guide_legend("Flavor"))+
labs(title = "What flavors appear to be used the most frequently?",
subtitle = "Flavors of tobacco products used in the past 30 days")
plot4

From this plot, we can see that fruit flavors are the most widely used products, followed by menthol or mint flavored products. We can also see that there was a general increase in the usage of flavored products over time.
We will now look specifically at the usage of flavored e-cigarette products vs other flavored tobacco products.
Recall that we made a variable called Group that identified students who used either just e-cigarette/vaping products, just other tobacco products (besides e-cigarettes), or students who used both e-cigarettes and some other type of tobacco product. We will compare the usage of these flavors for these different groups. We also perform some data summaries to decide how to order the panels (flavors) for display.
v_colors = viridis(5)[1:4]
plot5 <- nyts_data %>%
filter(year != 2015) %>%
group_by(year, Group) %>%
summarize(Menthol = (mean(menthol)*100),
`Clove or Spice` = (mean(clove_spice)*100),
Fruit = (mean(fruit)*100),
Chocolate = (mean(chocolate)*100),
`Alcoholic Drink` = (mean(alcoholic_drink)*100),
`Candy/Desserts/\nSweets` = (mean(candy_dessert_sweets)*100),
Other = (mean(other)*100),
Respondents = n()) %>%
#converting all columns between and including Menthol and Other to one column called Flavor
pivot_longer(cols = Menthol:Other, names_to = "Flavor", values_to = "Percentage of students") %>%
group_by(Flavor) %>%
# calculate the count of students in the year/group combination who used that flavor
mutate(affirmative=(Respondents * `Percentage of students`)/100) %>%
# calculate the fraction of total respondents who used that flavor
mutate(flavor_mean = sum(affirmative)/sum(Respondents)) %>%
ungroup() %>%
# reorder the levels of Flavor to be in increasing order of percent of students who used that flavor
mutate(flavor_mean_rank = dense_rank(flavor_mean),
Flavor = fct_reorder(Flavor, flavor_mean_rank)) %>%
ggplot(aes(x=year, y=`Percentage of students`, color=Group)) +
facet_grid(~Flavor)+
geom_line() +
geom_point(show.legend = FALSE, size = 2) +
scale_color_manual(values = v_colors) +
theme_linedraw() +
theme(legend.position = "bottom",
axis.title.x = element_blank(),
axis.text.x = element_text(angle = 90),
strip.text.x = element_text(size = 10, face = "bold")) +
labs(title = "Among different product users, what flavors are most frequently used?")
plot5

We can see from this plot that there has been an increase in the number of students reporting using flavored tobacco products. Users who use both e-cigarettes and other tobacco products appear to report using flavored products the most, followed by users who only use e-cigarettes.
Question 4
Is there a relationship between vaping rates and tobacco use? Now we will investigate the usage of e-cigarettes compared to other tobacco products in greater depth.
First let’s take a look at how e-cigarette usage and cigarette usage compare. We will select the data that specifically has to do with these products.
v_colors = viridis(6)[c(1,4)]
nyts_data %>%
group_by(year) %>%
summarize("Cigarettes, Ever" = (mean(ECIGT, na.rm = TRUE)*100),
"E-cigarettes, Ever" = (mean(EELCIGT, na.rm = TRUE)*100),
"Cigarettes, Current" = (mean(CCIGT, na.rm = TRUE)*100),
"E-cigarettes, Current" = (mean(CELCIGT, na.rm = TRUE)*100)) %>%
pivot_longer(cols= -year, names_to = "Category", values_to = "Percentage of students") %>%
separate( Category, into= c("Product", "User"), sep = ", ") %>%
head()
# A tibble: 6 x 4
year Product User `Percentage of students`
<dbl> <chr> <chr> <dbl>
1 2015 Cigarettes Ever 21.3
2 2015 E-cigarettes Ever 26.9
3 2015 Cigarettes Current 6.23
4 2015 E-cigarettes Current 11.2
5 2016 Cigarettes Ever 19.1
6 2016 E-cigarettes Ever 22.1
plot6 <- nyts_data %>%
group_by(year) %>%
summarize("Cigarettes, Ever" = (mean(ECIGT, na.rm = TRUE)*100),
"E-cigarettes, Ever" = (mean(EELCIGT, na.rm = TRUE)*100),
"Cigarettes, Current" = (mean(CCIGT, na.rm = TRUE)*100),
"E-cigarettes, Current" = (mean(CELCIGT, na.rm = TRUE)*100)) %>%
pivot_longer(cols= -year, names_to = "Category", values_to = "Percentage of students")%>%
separate( Category, into= c("Product", "User"), sep = ", ") %>%
ggplot(aes(x=year,y=`Percentage of students`, color = Product, linetype = User)) +
geom_line() +
geom_point(show.legend = FALSE, size = 2) +
scale_linetype_manual(values = c(2,1)) +
scale_color_manual(values = v_colors) +
theme_linedraw() +
theme(legend.position = "bottom",
axis.title.x = element_blank()) +
labs(title = "How does e-cigarette use compare to cigarette use?",
subtitle = "Current and ever users of e-cigarettes and cigarettes",
y = "% of students")
plot6

Interesting! we can see that in 2019 the percentage of students that reported currently using e-cigarettes had surpassed those that ever tried (even just once) a cigarette. Overall cigarette usage appears to be declining over time. This is not the case for e-cigarettes.
Now we will look at students who reported that they had ever tried e-cigarettes or non-cigarette products. In this case we will not separate out users who specifically only used one or the other. Therefore, the students included in this plot who reported as having ever tried e-cigarettes might also be current users of non-e-cigarette products or may have at least tried non-e-cigarette products.
v_colors = viridis(6)[c(1,4)]
plot7 <- nyts_data %>%
group_by(year) %>%
summarize(`e-cigarette_ever`= (mean(ecig_ever, na.rm = TRUE)*100),
`non-e-cigarette_ever`= (mean(non_ecig_ever, na.rm = TRUE)*100)) %>%
pivot_longer(cols = -year, names_to = "Category", values_to = "Percentage of students") %>%
separate(Category, into = c("Product", "User"), sep = "_") %>%
ggplot(aes(x=year,y=`Percentage of students`, color=Product)) +
geom_line() +
geom_point(show.legend = FALSE, size = 2) +
scale_color_manual(values = v_colors) +
scale_y_continuous(breaks = seq(0, 60, by = 10), limits = c(0,60)) +
theme_linedraw() +
theme(legend.position = "bottom",
axis.title.x = element_blank()) +
labs(title = "How does the rate of ever trying e-cigarettes \ncompare to ever trying other products over time?",
y = "% of students")
plot7

Now we will do the same, but for students who reported currently using e-cigarettes or non-e-cigarette products.
v_colors = viridis(6)[c(1,4)]
plot8 <- nyts_data %>%
group_by(year) %>%
summarize(`e-cigarette_current`= (mean(ecig_current, na.rm = TRUE)*100),
`non-e-cigarette_current`= (mean(non_ecig_current, na.rm = TRUE)*100)) %>%
pivot_longer(cols = -year, names_to = "Category", values_to = "Percentage of students") %>%
separate(Category, into = c("Product", "User"), sep = "_") %>%
ggplot(aes(x=year, y=`Percentage of students`, color=Product)) +
geom_line(linetype = "dashed") +
geom_point(show.legend = FALSE, size = 2) +
scale_color_manual(values = v_colors) +
scale_linetype_manual(values = c(1)) +
scale_y_continuous(breaks = seq(0, 60, by = 10), limits = c(0,60)) +
theme_linedraw() +
theme(legend.position = "bottom",
axis.title.x = element_blank()) +
labs(title = "How does the rate of currently using e-cigarettes \ncompare to currently using other products over time?",
y = "% of students")
plot8

Putting plots together
Now we will put these plots together using the plot_grid() function of the cowplot package. We will also modify the labels using the ggdraw() function, which is also part of the cowplot package.
plotA_uw <- plot1 +
theme(axis.title.x = element_blank(),
legend.position = "none") +
labs(title = "Tobacco product users more prevalent after 2017",
subtitle = NULL,
y = "% of students")
plotB_uw <- plot7 +
theme(axis.title.x = element_blank(),
legend.position = "none") +
labs(title = "% Ever trying e-cigarettes increases &\n% Ever trying other products decreases",
subtitle = NULL,
y = "% of students")
plotC_uw <- plot8 +
theme(axis.title.x = element_blank(),
legend.position = "none") +
labs(title = "% Currently using e-cigarettes increases &\n% Currently using other products decreases",
subtitle = NULL,
y = "% of students")
title_uw <- ggdraw() +
draw_label(
"Have e-cigarette rates possibly influenced tobacco use?",
fontface = 'bold',
size=14,
x = 0,
hjust = 0
) +
theme(
plot.margin = margin(0, 0, 0, 0)
)
plotsA_uw <- plot_grid(plotA_uw,
rel_widths = c(1,1))
plotsBC_uw <- plot_grid(plotB_uw,
plotC_uw,
rel_widths = c(1,1))
# this will take the legend from plot1c to use as the legend for the plot we are creating
legend_uw <- get_legend(plot1c +
theme(legend.position = "bottom",
legend.direction = "horizontal"))
figure_uw <- plot_grid(title_uw,
plotsA_uw,
plotsBC_uw,
legend_uw,
ncol = 1,
rel_heights = c(0.1,
1,
1,
0.1),
scale = 1.0)
figure_uw

Survey Weighting
It turns out that our analysis thus far has been brushing an important statistical concept under the rug, related to how our data were collected. Our data come from responses to a survey, which may have a particular sampling scheme to capture data about the population we are interested in. For example, the survey may be designed to capture a set of individuals who reflect the characteristics of the population that we are interested in drawing conclusions about. However, only a fraction of the individuals who were contacted about taking the survey may have completed it, and this fraction of individuals may no longer be representative of the population. Or the survey may be designed to over-sample a particular group of interest so that individuals from that group show up more often as survey respondents than are present in the population overall. In order to account for the fact that the survey respondents may not reflect the composition of the population we want to generalize to, we can emply a technique called survey weighting.
Survey weighting is a common technique used in survey data analysis because often the individuals that take a survey are not necessarily representative of the population that we are trying to gather information about. For example, we may have more females that respond to the survey than males because perhaps female students were more willing to participate. In this case, the proportion of data values in our data will be smaller for the males than the proportion of actual male students and larger for the females than the true proportion of actual female students. To get a better estimate of overall e-cigarette smoking rates, the data from the males can be weighted based on the true proportion of male students to amplify the contribution of the responses from the males that did participate. Conversely, the female data can be weighted to diminish the contribution if their responses to the overall picture. We will see if using survey weighting changes the general trends that we see in our data.
Calculating survey weights involves making a weight based on the ratio of the proportion of survey respondents from a particular group and the actual proportion of that group in the population. For example, let’s say that females account for 50% of the population and males account for 50 % of the population. Let’s also say that 75% of the respondents to the survey were female and only 25% were males.
Then we could calculate survey weights using this formula:
\[ \frac{\text{actual proportion of group in the population}}{\text{ proportion of group in the respondents}}\]
Thus the weight for the females would be calculated as:
\[ \frac{.5}{.75} = .67\]
The weight for the males would be calculated as:
\[ \frac{.5}{.25} = 2\]
Therefore each male response value would be multiplied by a factor of 2 and would have twice the contribution, while the female response values would have only about 70% of the contribution that they would have had without weighting.
Note that survey weights are in reality corrected for other aspects - for example the response rate to individual questions.
We do not need to calculate survey weights for our data as they were already supplied in the data set, as described in the codebooks.
srvyr package and survey design
We will now use the srvyr package to evaluate our data using survey weights that were provided in the data for each year, as described in the respective codebooks. This package contains functions that allow the user to easily perform calculations from the data that take the survey design into account, without having to work out the math by hand.
Within the data you will see that we have three variables related to the survey sampling scheme: psu, finwgt, and stratum. Details about these variables are available, for example, in the 2019 Methodology Report.
In brief they represent:
psu: Surveys like the one used to create the data we are using often sample people based on strata. This is done to ensure that the responses are representative of the population of interest. Thus, often people first think about ensuring that surveys are conducted in a variety of geographical areas. This is often called the primary sampling unit or PSU. In this survey, the county where the student’s school was located was used as the PSU.
stratum: A categorical variable that indicates subsets of the data that include respondents from different PSUs. In our case, strata are determined by the predominant minority in the PSU (Non-Hispanic Black or Hispanic), whether the PSU is urban or non-urban, and what percent of the students in the PSU fall into the predominant minority group. PSUs are allocated across the 16 possible strata according to the sampling scheme. These strata values allow estimates based on the survey responses to be calculated using different strata allowing for improved precision of the response estimates.
AVOCADO: I added a bunch of details here to the definition of stratum - perhaps give it a quick read. I also think it might be worth including one of the Methodology Reports, which is where I pulled this information from.
finwgt: The survey weight which was calculated based on a variety of factors.
This link and this link have more information about the study design of the data that we are using.
For detailed information on such survey designs in general see here and here.
We will use the as_survey_design() function of the srvyrpackage to create a survey object with a specified survey design. This is a special R object that includes information about how the survey was conducted that can be taken into account in the analysis.
There are several arguments to pay attention to:
- The
strata argument is used to specify the variable(s) that defined strata in the data. In this case, we will use the stratum variable.
- The
ids argument is used to define cluster ids within the data. In this case we will use the psu variable.
- The
weight argument is the used to define which variable(s) are the survey weights.
- The
nest = TRUE argument, forces cluster ids (in this case the PSU) to be nested within the strata.
We can then use the survey_mean() function to calculate percentages of students who report using tobacco for each year while accounting for the survey design and weights. We will specify that we want confidence interval estimates by using the vartype = "ci" argument. The confidence intervals in our case give a range of possible values for the true population mean based on the data observed in the survey. We will multiply these values by 100 to get percentages. (Note: We could also have calculated confidence intervals for the unweighted results above by computing them by hand; we leave this as a potential exercise.)
Since the survey weights are specific to a single year of the survey results, we need to create survey design objects for each year separately. We will use group_by and group_modify, which is also from the dplyr package, to do this. We first write the function that we want to call on each group.
This function will take an input called currYear, which will be one set of survey responses for a specific year, and then craetes a survey design based on the stratum and finwgt specific to that year. It will then calculate the mean for student respondants who have ever tried any tobacco products or are a current user of any tobacco products accounting for the survey design and weights using survey_mean() as was just described. The function will then wrangle the data to produce new variables about the type of mean estimate that was given from the survey_mean output and the type of user. This
Weighted Sample
# A tibble: 5 x 7
# Groups: year [5]
year tobacco_ever tobacco_ever_low tobacco_ever_upp tobacco_current
<dbl> <dbl> <dbl> <dbl> <dbl>
1 2015 0.372 0.344 0.400 0.180
2 2016 0.338 0.319 0.358 0.148
3 2017 0.307 0.284 0.330 0.139
4 2018 0.339 0.318 0.360 0.185
5 2019 0.408 0.384 0.433 0.233
# … with 2 more variables: tobacco_current_low <dbl>, tobacco_current_upp <dbl>
Now let’s make the function wrangle the output in a more usable form too:
surveyMeanA <- function(currYear){
options(survey.lonely.psu = "average")
currYear %>%
as_survey_design(strata = stratum,
ids = psu,
weight = finwgt,
nest = TRUE) %>%
summarize(tobacco_ever = survey_mean(tobacco_ever,
vartype = "ci",
na.rm = TRUE),
tobacco_current = survey_mean(tobacco_current,
vartype = "ci",
na.rm = TRUE)) %>%
mutate_all( "*", 100) %>%
pivot_longer(everything(),
names_to = "Type",
values_to = "Percentage of students") %>%
mutate(Estimate = case_when(str_detect(Type,"_low") ~ "Lower",
grepl("_upp", Type) ~ "Upper",
TRUE ~ "Mean"),
User = case_when(grepl("ever", Type) ~ "Ever",
grepl("current", Type) ~ "Current",
TRUE ~ "Mean"))
}
nyts_data %>%
group_by(year) %>%
group_modify(~surveyMeanA(.x))
# A tibble: 30 x 5
# Groups: year [5]
year Type `Percentage of students` Estimate User
<dbl> <chr> <dbl> <chr> <chr>
1 2015 tobacco_ever 37.2 Mean Ever
2 2015 tobacco_ever_low 34.4 Lower Ever
3 2015 tobacco_ever_upp 40.0 Upper Ever
4 2015 tobacco_current 18.0 Mean Current
5 2015 tobacco_current_low 16.2 Lower Current
6 2015 tobacco_current_upp 19.9 Upper Current
7 2016 tobacco_ever 33.8 Mean Ever
8 2016 tobacco_ever_low 31.9 Lower Ever
9 2016 tobacco_ever_upp 35.8 Upper Ever
10 2016 tobacco_current 14.8 Mean Current
# … with 20 more rows
### AVOCADO: Remove rest of chunk from here if you are OK with this way of doing it
nyts_data %>%
as_survey_design(strata = stratum,
ids = psu,
weight = finwgt,
nest = TRUE) %>%
group_by(year) %>%
summarize(tobacco_ever = survey_mean(tobacco_ever,
vartype = "ci",
na.rm = TRUE),
tobacco_current = survey_mean(tobacco_current,
vartype = "ci",
na.rm = TRUE)) %>%
mutate_at(vars(-year), "*", 100) %>%
pivot_longer(cols = -year,
names_to = "Type",
values_to = "Percentage of students") %>%
mutate(Estimate = case_when(grepl("_low", Type) ~ "Lower",
grepl("_upp", Type) ~ "Upper",
TRUE ~ "Mean"),
User = case_when(grepl("ever", Type) ~ "Ever",
grepl("current", Type) ~ "Current",
TRUE ~ "Mean")) %>%
head()
# A tibble: 6 x 5
year Type `Percentage of students` Estimate User
<dbl> <chr> <dbl> <chr> <chr>
1 2015 tobacco_ever 37.2 Mean Ever
2 2015 tobacco_ever_low 34.4 Lower Ever
3 2015 tobacco_ever_upp 40.0 Upper Ever
4 2015 tobacco_current 18.0 Mean Current
5 2015 tobacco_current_low 16.2 Lower Current
6 2015 tobacco_current_upp 19.8 Upper Current
We will now make a plot using this data. The confidence intervals are included using the geom_linerange() function of the ggplot2 package.
### AVOCADO: Remove this first version of the figure if you are OK with my modifications below
plotA_w <- nyts_data %>%
as_survey_design(strata = stratum,
ids = psu,
weight = finwgt,
nest = TRUE) %>%
group_by(year) %>%
summarize(tobacco_ever = survey_mean(tobacco_ever,
vartype = "ci",
na.rm = TRUE),
tobacco_current = survey_mean(tobacco_current,
vartype = "ci",
na.rm = TRUE)) %>%
mutate_at(vars(-year), "*", 100) %>%
pivot_longer(cols = -year,
names_to = "Type",
values_to = "Percentage of students") %>%
mutate(Estimate = case_when(grepl("_low", Type) ~ "Lower",
grepl("_upp", Type) ~ "Upper",
TRUE ~ "Mean"),
User = case_when(grepl("ever", Type) ~ "Ever",
grepl("current", Type) ~ "Current",
TRUE ~ "Mean")) %>%
dplyr::select(-Type) %>%
pivot_wider(names_from = Estimate,
values_from = `Percentage of students`) %>%
ggplot(aes(x=year,y=Mean)) +
geom_line(aes(linetype=User)) +
geom_linerange(aes(ymin = Lower, ymax = Upper), size = 1, show.legend = FALSE) +
scale_linetype_manual(values = c(2,1)) +
scale_color_manual(values = v_colors) +
scale_y_continuous(breaks = seq(0,70,by=10),
labels = seq(0,70,by=10),
limits = c(0,70)) +
theme_linedraw() +
theme(legend.position = "none",
axis.title.x = element_blank()) +
labs(title = "Tobacco product users more prevalent after 2017",
y = "% of students")
plotA_w

## AVOCADO: updated version using group_modify
plotA_w <- nyts_data %>%
group_by(year) %>%
group_modify(~surveyMeanA(.x)) %>%
dplyr::select(-Type) %>%
pivot_wider(names_from = Estimate,
values_from = `Percentage of students`) %>%
ggplot(aes(x=year,y=Mean)) +
geom_line(aes(linetype=User)) +
geom_linerange(aes(ymin = Lower, ymax = Upper), size = 1, show.legend = FALSE) +
scale_linetype_manual(values = c(2,1)) +
scale_color_manual(values = v_colors) +
scale_y_continuous(breaks = seq(0,70,by=10),
labels = seq(0,70,by=10),
limits = c(0,70)) +
theme_linedraw() +
theme(legend.position = "none",
axis.title.x = element_blank()) +
labs(title = "Tobacco product users more prevalent after 2017",
y = "% of students")
plotA_w

Now we can see that we have confidence interval ranges plotted for each value.
We will make a similar plot for students who reported ever trying or who currently use e-cigarettes as opposed to tobacco in general.
surveyMeanB <- function(currYear){
options(survey.lonely.psu = "average")
currYear %>%
as_survey_design(strata = stratum,
ids = psu,
weight = finwgt,
nest = TRUE) %>%
summarize(ecig_ever_year = survey_mean(ecig_ever, vartype = "ci", na.rm=TRUE),
non_ecig_ever_year = survey_mean(non_ecig_ever, vartype = "ci", na.rm=TRUE)) %>%
mutate_all("*", 100) %>%
pivot_longer(everything(),
names_to = "Category",
values_to = "Percentage of students") %>%
mutate(Estimate = case_when(grepl("_low", Category) ~ "Lower",
grepl("_upp", Category) ~ "Upper",
TRUE ~ "Mean"),
User = case_when(grepl("ever", Category) ~ "Ever",
grepl("current", Category) ~ "Current"),
Product = case_when(grepl("non_ecig", Category) ~ "Other products",
TRUE ~ "E-cigarettes")) %>%
dplyr::select(-Category) %>%
pivot_wider(names_from = Estimate,
values_from = `Percentage of students`)
}
nyts_data %>%
group_by(year) %>%
group_modify(~surveyMeanB(.x)) %>%
head()
# A tibble: 6 x 6
# Groups: year [3]
year User Product Mean Lower Upper
<dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 2015 Ever E-cigarettes 26.6 24.3 29.0
2 2015 Ever Other products 31.3 28.7 33.8
3 2016 Ever E-cigarettes 22.6 21.0 24.3
4 2016 Ever Other products 28.2 26.2 30.2
5 2017 Ever E-cigarettes 21.1 19.1 23.2
6 2017 Ever Other products 24.3 22.2 26.4
## AVOCADO: Drop this version of code
nyts_data %>%
as_survey_design(strata = stratum, ids = psu, weight = finwgt, nest=TRUE) %>%
group_by(year) %>%
summarize(ecig_ever_year = survey_mean(ecig_ever, vartype = "ci", na.rm=TRUE),
non_ecig_ever_year = survey_mean(non_ecig_ever, vartype = "ci", na.rm=TRUE)) %>% mutate_at(vars(-year), "*", 100) %>%
pivot_longer(cols = -year,
names_to = "Category",
values_to = "Percentage of students") %>%
mutate(Estimate = case_when(grepl("_low", Category) ~ "Lower",
grepl("_upp", Category) ~ "Upper",
TRUE ~ "Mean"),
User = case_when(grepl("ever", Category) ~ "Ever",
grepl("current", Category) ~ "Current"),
Product = case_when(grepl("non_ecig", Category) ~ "Other products",
TRUE ~ "E-cigarettes")) %>%
dplyr::select(-Category) %>%
pivot_wider(names_from = Estimate,
values_from = `Percentage of students`) %>%
head()
# A tibble: 6 x 6
year User Product Mean Lower Upper
<dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 2015 Ever E-cigarettes 26.6 24.4 28.9
2 2015 Ever Other products 31.3 28.6 33.9
3 2016 Ever E-cigarettes 22.6 20.9 24.3
4 2016 Ever Other products 28.2 26.1 30.3
5 2017 Ever E-cigarettes 21.1 19.1 23.1
6 2017 Ever Other products 24.3 22.1 26.5
## AVOCADO: Drop this version of plot
plotB_w <- nyts_data %>%
as_survey_design(strata = stratum, ids = psu, weight = finwgt, nest=TRUE) %>%
group_by(year) %>%
summarize(ecig_ever_year = survey_mean(ecig_ever, vartype = "ci", na.rm=TRUE),
non_ecig_ever_year = survey_mean(non_ecig_ever, vartype = "ci", na.rm=TRUE)) %>% mutate_at(vars(-year), "*", 100) %>%
pivot_longer(cols = -year,
names_to = "Category",
values_to = "Percentage of students") %>%
mutate(Estimate = case_when(grepl("_low", Category) ~ "Lower",
grepl("_upp", Category) ~ "Upper",
TRUE ~ "Mean"),
User = case_when(grepl("ever", Category) ~ "Ever",
grepl("current", Category) ~ "Current"),
Product = case_when(grepl("non_ecig", Category) ~ "Other products",
TRUE ~ "E-cigarettes")) %>%
dplyr::select(-Category) %>%
pivot_wider(names_from = Estimate,
values_from = `Percentage of students`) %>%
ggplot(aes(x=year,y=Mean, color = Product)) +
geom_line() +
geom_linerange(aes(ymin = Lower, ymax = Upper), size = 1, show.legend = FALSE) +
scale_linetype_manual(values = c(2,1)) +
scale_color_manual(values = v_colors) +
scale_y_continuous(breaks = seq(0,60,by=10),
labels = seq(0,60,by=10),
limits = c(0,60)) +
theme_linedraw() +
theme(legend.position = "none",
axis.title.x = element_blank()) +
labs(title = "% Ever trying e-cigarettes increases &\n% Ever trying other products decreases",
y = "% of students")
plotB_w

plotB_w <- nyts_data %>%
group_by(year) %>%
group_modify(~surveyMeanB(.x)) %>%
ggplot(aes(x=year,y=Mean, color = Product)) +
geom_line() +
geom_linerange(aes(ymin = Lower, ymax = Upper), size = 1, show.legend = FALSE) +
scale_linetype_manual(values = c(2,1)) +
scale_color_manual(values = v_colors) +
scale_y_continuous(breaks = seq(0,60,by=10),
labels = seq(0,60,by=10),
limits = c(0,60)) +
theme_linedraw() +
theme(legend.position = "none",
axis.title.x = element_blank()) +
labs(title = "% Ever trying e-cigarettes increases &\n% Ever trying other products decreases",
y = "% of students")
plotB_w

Now we will do the same but for current users:
surveyMeanC <- function(currYear){
options(survey.lonely.psu = "average")
currYear %>%
as_survey_design(strata = stratum,
ids = psu,
weight = finwgt,
nest = TRUE) %>%
summarize(ecig_current_year = survey_mean(ecig_current, vartype = "ci", na.rm=TRUE),
non_ecig_current_year = survey_mean(non_ecig_current, vartype = "ci", na.rm=TRUE)) %>%
mutate_all("*", 100) %>%
pivot_longer(everything(),
names_to = "Category",
values_to = "Percentage of students") %>%
mutate(Estimate = case_when(grepl("_low", Category) ~ "Lower",
grepl("_upp", Category) ~ "Upper",
TRUE ~ "Mean"),
User = case_when(grepl("ever", Category) ~ "Ever",
grepl("current", Category) ~ "Current"),
Product = case_when(grepl("non_ecig", Category) ~ "Other products",
TRUE ~ "E-cigarettes")) %>%
dplyr::select(-Category) %>%
pivot_wider(names_from = Estimate,
values_from = `Percentage of students`)
}
#AVOCADO: Drop this version of plot code
plotC_w <- nyts_data %>%
as_survey_design(strata = stratum, ids = psu, weight = finwgt, nest=TRUE) %>%
group_by(year) %>%
summarize(ecig_current_year = survey_mean(ecig_current, vartype = "ci", na.rm=TRUE),
non_ecig_current_year = survey_mean(non_ecig_current, vartype = "ci", na.rm=TRUE)) %>%
mutate_at(vars(-year), "*", 100) %>%
pivot_longer(cols = -year,
names_to = "Category",
values_to = "Percentage of students") %>%
mutate(Estimate = case_when(grepl("_low", Category) ~ "Lower",
grepl("_upp", Category) ~ "Upper",
TRUE ~ "Mean"),
User = case_when(grepl("ever", Category) ~ "Ever",
grepl("current", Category) ~ "Current"),
Product = case_when(grepl("non_ecig", Category) ~ "Other products",
TRUE ~ "E-cigarettes")) %>%
dplyr::select(-Category) %>%
pivot_wider(names_from = Estimate,
values_from = `Percentage of students`) %>%
ggplot(aes(x=year,y=Mean, color=Product)) +
geom_line(aes(linetype="dashed")) +
geom_linerange(aes(ymin = Lower, ymax = Upper), size = 1, show.legend = FALSE) +
scale_linetype_manual(values = c(2,1)) +
scale_y_continuous(breaks = seq(0, 60, by = 10), limits = c(0,60)) +
scale_color_manual(values = v_colors) +
theme_linedraw() +
theme(legend.position = "none",
axis.title.x = element_blank()) +
labs(title = "% Currently using e-cigarettes increases &\n% Currently using other products decreases",
y = "% of students")
plotC_w

plotC_w <- nyts_data %>%
group_by(year) %>%
group_modify(~surveyMeanC(.x)) %>%
ggplot(aes(x=year,y=Mean, color=Product)) +
geom_line(aes(linetype="dashed")) +
geom_linerange(aes(ymin = Lower, ymax = Upper), size = 1, show.legend = FALSE) +
scale_linetype_manual(values = c(2,1)) +
scale_y_continuous(breaks = seq(0, 60, by = 10), limits = c(0,60)) +
scale_color_manual(values = v_colors) +
theme_linedraw() +
theme(legend.position = "none",
axis.title.x = element_blank()) +
labs(title = "% Currently using e-cigarettes increases &\n% Currently using other products decreases",
y = "% of students")
plotC_w

Now we will put these plots together again using the cowplot package:
title_w <- ggdraw() +
draw_label(
expression("What is the relationship between vaping rates and tobacco use?"),
fontface = 'bold',
size=14,
x = 0,
hjust = 0
) +
theme(
plot.margin = margin(0, 0, 0, 0)
)
plotsA_w <- plot_grid(plotA_w,
rel_widths = c(1),
align = "v",
axis = "bt")
plotsBC_w <- plot_grid(plotB_w,
plotC_w,
rel_widths = c(1,1),
align = "v",
axis = "bt")
legend_w <- get_legend(plot1c +
theme(legend.position = "bottom",
legend.direction = "horizontal"))
figure_w <- plot_grid(title_w,
plotsA_w,
plotsBC_w,
legend_w,
ncol = 1,
rel_heights = c(0.1,
1,
1,
0.1),
scale = 1.0)
figure_w

We can see that these figures look quite similar to the ones generated without using the survey weights.
Artificial Cohort
Although the survey design does not allow specific individuals to be followed over time, we will use certain subsets of the data from each year to construct an artificial cohort where we follow students of the same age group as they get older. This will allow us to look at how tobacco usage changed for students who were in 8th grade in 2015 as they aged.
All of the data so far has included all 6th-12th graders every year. Now we will look at just the data for students expected to graduate in 2019. These are the students who were in 8th grade in 2015, most of whom were 9th graders in 2016, 10th graders in 2017 and so on. We will filter the data to just the students expected to be in the graduating class of 2019.
surveyMeanCohort <- function(currYear){
options(survey.lonely.psu = "average")
currYear %>%
as_survey_design(strata = stratum,
ids = psu,
weight = finwgt,
nest = TRUE) %>%
summarize(ecig_ever_year = survey_mean(ecig_ever, vartype = "ci", na.rm=TRUE),
ecig_current_year = survey_mean(ecig_current, vartype = "ci", na.rm=TRUE),
non_ecig_ever_year = survey_mean(non_ecig_ever, vartype = "ci", na.rm=TRUE),
non_ecig_current_year = survey_mean(non_ecig_current, vartype = "ci", na.rm=TRUE),
tobacco_ever_year = survey_mean(tobacco_ever, vartype = "ci", na.rm=TRUE),
tobacco_current_year = survey_mean(tobacco_current, vartype = "ci", na.rm=TRUE))%>%
mutate_all("*", 100) %>%
pivot_longer(everything(),
names_to = "Category",
values_to = "Percentage of students") %>%
mutate(Estimate = case_when(grepl("_low", Category) ~ "Lower",
grepl("_upp", Category) ~ "Upper",
TRUE ~ "Mean"),
User = case_when(grepl("ever", Category) ~ "Ever",
grepl("current", Category) ~ "Current"),
Product = case_when(grepl("non_ecig", Category) ~ "Other products",
grepl("tobacco", Category) ~ "Any tobacco product",
TRUE ~ "E-cigarettes")) %>%
dplyr::select(-Category) %>%
pivot_wider(names_from = Estimate,
values_from = `Percentage of students`)
}
Cohort_data <-nyts_data %>%
filter((Grade == "8" & year == 2015) |
(Grade == "9" & year == 2016) |
(Grade == "10" & year == 2017) |
(Grade == "11" & year == 2018) |
(Grade == "12" & year == 2019)
) %>%
as_survey_design(strata = stratum, ids = psu, weight = finwgt) %>%
group_by(year) %>%
summarize(ecig_ever_year = survey_mean(ecig_ever, vartype = "ci", na.rm=TRUE),
ecig_current_year = survey_mean(ecig_current, vartype = "ci", na.rm=TRUE),
non_ecig_ever_year = survey_mean(non_ecig_ever, vartype = "ci", na.rm=TRUE),
non_ecig_current_year = survey_mean(non_ecig_current, vartype = "ci", na.rm=TRUE),
tobacco_ever_year = survey_mean(tobacco_ever, vartype = "ci", na.rm=TRUE),
tobacco_current_year = survey_mean(tobacco_current, vartype = "ci", na.rm=TRUE))%>%
mutate_at(vars(-year), "*", 100) %>%
pivot_longer(cols = -year, names_to = "Category", values_to = "Percentage of students")%>%
mutate(Estimate = case_when(grepl("_low", Category) ~ "Lower",
grepl("_upp", Category) ~ "Upper",
TRUE ~ "Mean"),
User = case_when(grepl("ever", Category) ~ "Ever",
grepl("current", Category) ~ "Current"),
Product = case_when(grepl("non_ecig", Category) ~ "Other products",
grepl("tobacco", Category) ~ "Any tobacco product",
TRUE ~ "E-cigarettes")) %>%
dplyr::select(-Category) %>%
pivot_wider(names_from = Estimate,
values_from = `Percentage of students`)
head(Cohort_data)
# A tibble: 6 x 6
year User Product Mean Lower Upper
<dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 2015 Ever E-cigarettes 20.1 17.1 23.0
2 2015 Current E-cigarettes 8.12 6.72 9.52
3 2015 Ever Other products 20.7 17.9 23.6
4 2015 Current Other products 7.06 5.66 8.45
5 2015 Ever Any tobacco product 26.9 23.5 30.3
6 2015 Current Any tobacco product 10.9 9.11 12.6
# A tibble: 6 x 6
# Groups: year [1]
year User Product Mean Lower Upper
<dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 2015 Ever E-cigarettes 20.1 16.8 23.3
2 2015 Current E-cigarettes 8.12 6.65 9.59
3 2015 Ever Other products 20.7 17.5 23.9
4 2015 Current Other products 7.06 5.53 8.58
5 2015 Ever Any tobacco product 26.9 23.1 30.7
6 2015 Current Any tobacco product 10.9 9.01 12.7
We will now make similar plots to those above for this subset of the data:
plotA_w_8 <-Cohort_data %>%
filter(Product == "Any tobacco product") %>%
ggplot(aes(x=year,y=Mean)) +
geom_line(aes(linetype=User)) +
geom_linerange(aes(ymin = Lower, ymax = Upper), size = 1) +
scale_linetype_manual(values = c(2,1)) +
scale_y_continuous(breaks = seq(0,70,by=10),
labels = seq(0,70,by=10),
limits = c(0,70)) +
scale_color_manual(values = v_colors) +
theme_linedraw() +
theme(legend.position = "none",
axis.title.x = element_blank()) +
labs(title = "Tobacco product use became increasingly prevalent",
y = "% of students")
plotB_w_8 <-Cohort_data %>%
filter(Product != "Any tobacco product", User == "Ever") %>%
ggplot(aes(x=year,y=Mean, color=Product)) +
geom_line(linetype=1) +
geom_linerange(aes(ymin = Lower, ymax = Upper), size = 1) +
scale_y_continuous(breaks = seq(10, 60, by = 10), limits = c(10,60)) +
scale_color_manual(values = v_colors) +
theme_linedraw() +
theme(legend.position = "none",
axis.title.x = element_blank()) +
labs(title = "% ever trying tobacco products increases",
y = "% of students")
plotC_w_8 <-Cohort_data %>%
filter(Product != "Any tobacco product", User == "Current") %>%
ggplot(aes(x=year,y=Mean, color=Product)) +
geom_line(aes(linetype=User)) +
geom_linerange(aes(ymin = Lower, ymax = Upper), size = 1) +
scale_linetype_manual(values = c(2,1)) +
scale_y_continuous(breaks = seq(0, 60, by = 10), limits = c(0,60)) +
scale_color_manual(values = v_colors) +
theme_linedraw() +
theme(legend.position = "none",
axis.title.x = element_blank()) +
labs(title = "E-cigarette use surpasses use of other products",
y = "% of students")
title_w_8 <- ggdraw() +
draw_label(
expression("Among students expected to be in the graduating class of 2019, how are vaping rates related to tobacco use?"),
fontface = 'bold',
size=14,
x = 0,
hjust = 0
) +
theme(
plot.margin = margin(0, 0, 0, 0)
)
plotsA_w_8 <- plot_grid(plotA_w_8,
rel_widths = c(1),
align = "v",
axis = "bt")
plotsBC_w_8 <- plot_grid(plotB_w_8,
plotC_w_8,
rel_widths = c(1,1),
axis = "bt")
legend_w_8 <- get_legend(plot1c +
theme(legend.position = "bottom",
legend.direction = "horizontal"))
figure_w_8 <- plot_grid(title_w_8,
plotsA_w_8,
plotsBC_w_8,
legend_w_8,
ncol = 1,
rel_heights = c(0.1,
1,
1,
0.1),
scale = 1.0
)
figure_w_8

---
title: "Open Case Studies : Vaping Behaviors in American Youth"
author: "Michael Ontiveros, Carrie Wright, PhD. "

css: style.css
output:
  html_document:
    self_contained: yes
    code_download: yes
    highlight: tango
    number_sections: no
    theme: cosmo
    toc: yes
    toc_float: yes
  pdf_document:
    toc: yes
  word_document:
    toc: yes

---
<style>
#TOC {
  background: url("https://opencasestudies.github.io/img/logo.jpg");
  background-size: contain;
  padding-top: 240px !important;
  background-repeat: no-repeat;
}
</style>

```{r, echo=FALSE}
knit_time_start <- Sys.time()
```

```{r, echo=FALSE}
knitr::opts_chunk$set(fig.width=10, fig.height=8, dpi=300) 
```

```{r setup, include=FALSE}
knitr::opts_chunk$set(include = TRUE, comment = NA, echo = TRUE,
                      message = FALSE, warning = FALSE, cache = FALSE,
                      fig.align = "center", out.width = '90%')
library(here)
library(knitr)
```


#### {.outline }
```{r, echo = FALSE, out.width = "800 px"}
knitr::include_graphics(here::here("img", "final_plot.png"))
```
####




## {.disclaimer_block}

**Disclaimer**: The purpose of the [Open Case Studies](https://opencasestudies.github.io){target="_blank"} project is **to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data**. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given data set, and should not be used in the context of making policy decisions without external consultation from scientific experts. 

## **Motivation**
*** 
According to a recent [report](https://www.cdc.gov/mmwr/volumes/68/wr/mm6806e1.htm?s_cid=mm6806e1_w){target="_blank"}, overall tobacco use **increased** in youths (middle school and high school students) in the United States in 2017 and 2018, despite previous years of declining use.

This major increase is attributed to an increase in the use of electronic cigarette (e-cigarette) products.

E-cigarettes are referred to by many different names, including but not limited to:

1) Electronic nicotine delivery systems (ENDS)
2) Vapes
3) e-hookahs
4) vape pens
5) tanks
6) mods

The devices vary greatly:

```{r, echo = FALSE, fig.align ="center"}

include_graphics("https://www.lung.org/getmedia/8ac8ab8c-e7fc-497b-8384-441615f50ff0/ecigs_K.jpg.jpg")
```

##### [[source](https://www.lung.org/quit-smoking/e-cigarettes-vaping/lung-health)]

See this [CDC guide](https://www.cdc.gov/tobacco/basic_information/e-cigarettes/pdfs/ecigarette-or-vaping-products-visual-dictionary-508.pdf){target="_blank"} and the [American Lung Association website](https://www.lung.org/quit-smoking/e-cigarettes-vaping/lung-health){target="_blank"} for more information. 

The report found that:

> During 2017–2018, current use of any tobacco product **increased 38.3%** (from 19.6% to 27.1%) among high school students and **28.6%** (from 5.6% to 7.2%) among middle school students; e-cigarette use **increased 77.8%** (from 11.7% to 20.8%) among high school students and **48.5%** (from 3.3% to 4.9%) among middle school students.


In 2018, the [Federal Drug Administration (FDA) in the United States](https://acsjournals.onlinelibrary.wiley.com/doi/full/10.1002/cncr.31868){target="_blank"} stated that e-cigarette usage use among youth reached:  

> “nothing short of an **epidemic proportion of growth**”


In this case study, we will be investigating the same data used in the report that generated the above findings. This data comes from the [The National Youth Tobacco Survey (NYTS)](https://www.cdc.gov/tobacco/data_statistics/surveys/nyts/index.htm){target="_blank"}.

#### {.reference_block}

Gentzke, Andrea S., Melisa Creamer, Karen A. Cullen, Bridget K. Ambrose, Gordon Willis, Ahmed Jamal, and Brian A. King.  “Vital Signs: Tobacco Product Use Among Middle and High School Students - United States, 2011-2018.” **MMWR. Morbidity and Mortality Weekly Report** 68 (6): 157–64 (2019).

####


## **Main Questions**
*** 

#### {.main_question_block}
<b><u> Our main question: </u></b>

1) How has tobacco and e-cigarette/vaping use by American youths changed since 2015?
2) How do vaping rates compare between males and females?
3) What vaping brands and flavors appear to be used the most frequently?  
We will base this on the following survey questions:   
> "During the past 30 days, what brand of e-cigarettes did you usually use?"   
>" What flavors of tobacco products have you used in the past
30 days?" 

4) Is there a relationship between e-cigarette/vaping use and other tobacco use?

####

AVOCADO: Note that I did change question 4 above to be less causal sounding

## **Learning Objectives** 
*** 

In this case study, we will cover how to import data from multiple files efficiently, how to import data from excel files, and how to make a variety of visualizations to compare multiple groups across time. We will also demonstrate how to work with codebooks. We will cover the concept of survey weighting and introduce the `srvyr` package. We will discuss the difference between pooled cross-sectional data and panel data. We will especially focus on using packages and functions from the [`Tidyverse`](https://www.tidyverse.org/){target="_blank"} for wrangling data, such as `tidyr` and `dplyr` and for visualization, such as as `ggplot2`. The tidyverse is a library of packages created by RStudio. While some students may be familiar with previous R programming packages, these packages make data science in R especially efficient.


AVOCADO: I would like these to be more detailed and probably separated into two categories along the lines of data science skills and statistical methods, so that the data import and visualization are in one category and the survey weighting is in another category. This is to help instructors who may be looking for an illustration of a particular tool or method to more readily identify an appropriate case study. More like keywords..... An attempt is below.

Data science skills:

1. Import data from Excel files
2. Merge data from multiple similar but not identical data structures
3. Create effective longitudinal data visualizations

Statistical methods:

1. Survey weighting


```{r, out.width = "20%", echo = FALSE, fig.align ="center"}

include_graphics("https://tidyverse.tidyverse.org/logo.png")
```


*** 


We will begin by loading the packages that we will need:

```{r}
library(here)
library(readxl)
library(magrittr)
library(stringr)
library(purrr)
library(dplyr)
library(readr)
library(tidyr)
library(ggplot2)
library(scales)
library(viridis)
library(forcats)
library(naniar)
library(srvyr)
library(cowplot)
```

 Package   | Use                                                                         
---------- |-------------
[here](https://github.com/jennybc/here_here){target="_blank"}       | to easily load and save data  
[readxl](https://readxl.tidyverse.org/){target="_blank"}      | to import the data in the excel files 
[magrittr](https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html){target="_blank"} | to use the compound assignment pipe operator `%<>%`
[stringr](https://stringr.tidyverse.org/articles/stringr.html){target="_blank"}    | to manipulate the character strings within the data  
[purrr](https://purrr.tidyverse.org/){target="_blank"}   | to import the data in all the different excel and csv files efficiently
[dplyr](https://dplyr.tidyverse.org/){target="_blank"}      | to arrange/filter/select/compare specific subsets of the data  
[readr](https://readr.tidyverse.org/){target="_blank"}      | to import the CSV file data
[tidyr](https://tidyr.tidyverse.org/){target="_blank"}      | to rearrange data in wide and long formats 
[ggplot2](https://ggplot2.tidyverse.org/){target="_blank"}    | to make visualizations with multiple layers
[scales](https://cran.r-project.org/web/packages/scales/scales.pdf){target="_blank"}    | to allow us to look at the colors within the viridis package
[viridis](https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html){target="_blank"}    | to make plots with a color palette that is compatible with color blindness
[forcats](https://forcats.tidyverse.org/){target="_blank"}    | to allow for reordering of factors in plots
[naniar](https://cran.r-project.org/web/packages/naniar/vignettes/getting-started-w-naniar.html){target="_blank"}  | to make a visualization of missing data
[syrvr](https://cran.r-project.org/web/packages/srvyr/srvyr.pdf){target="_blank"} | to use survey weights
[cowplot](https://cran.r-project.org/web/packages/cowplot/vignettes/introduction.html){target="_blank"} | to allow plots to be combined 


The first time we use a function, we will use the `::` to indicate which package we are using. Unless we have overlapping function names, this is not necessary, but we will include it here to be informative about where the functions we will use come from.


## **Context**
*** 

According to the cited [Morbidity and Mortality Weekly Report](https://www.cdc.gov/mmwr/volumes/68/wr/mm6806e1.htm?s_cid=mm6806e1_w) this was what was already known about this topic and the implications of this study:

```{r, echo = FALSE, fig.align ="center", out.width = "800 px"}

knitr::include_graphics(here::here("img", "context.png"))

```


Importantly, the vapors used in e-cigarettes contain harmful chemicals:

```{r, echo = FALSE, fig.align ="center"}

include_graphics("https://www.cdc.gov/tobacco/basic_information/e-cigarettes/images/e-cigarette-aerosol-can-contain-harmful-ingredients-desktop-700.jpg")
```

#### [[source](https://www.cdc.gov/tobacco/basic_information/e-cigarettes/images/e-cigarette-aerosol-can-contain-harmful-ingredients-desktop-700.jpg)]{target="_blank"} 

E-cigarette usage has also been associated with [lung injury]((https://www.frontiersin.org/articles/10.3389/fphar.2019.01619/full)){target="_blank"}:

```{r, echo = FALSE, fig.align ="center", out.width = "800 px"}

knitr::include_graphics(here::here("img", "lung.png"))
```

#### [[source](https://www.frontiersin.org/articles/10.3389/fphar.2019.01619/full)]{target="_blank"} 

See [here](https://www.cdc.gov/tobacco/basic_information/e-cigarettes/Quick-Facts-on-the-Risks-of-E-cigarettes-for-Kids-Teens-and-Young-Adults.html){target="_blank"} for additional information about the potential health effects of e-cigarettes in teens and young adults.

## **Limitations**
*** 
There are some important considerations regarding this data analysis to keep in mind: 

1) The [National Youth Tobacco Survey (NYTS)](https://www.cdc.gov/tobacco/data_statistics/surveys/nyts/index.htm){target="_blank"} does not follow the same individual students over time.  A [longitudinal study](https://www.bmj.com/about-bmj/resources-readers/publications/epidemiology-uninitiated/7-longitudinal-studies){target="_blank"} that does follow the same individuals over time collects data called [panel data](https://en.wikipedia.org/wiki/Panel_data). The data in this study is called pooled [cross-sectional data](https://en.wikipedia.org/wiki/Cross-sectional_data), and is obtained from random collection of observations across time.

According to Wikipedia:

> Panel data differs from pooled cross-sectional data across time, because it deals with the observations on the same subjects in different times whereas the latter observes different subjects in different time periods

2) The data include percentages of students reporting use of each particular tobacco product, but the survey questions did not ask the relative amount of use of one product compared to another. For example the survey included questions like:"What flavors of tobacco products have you used in the past 30 days?" but did not ask how often one flavor was used by the same individual over another.

While [gender](https://www.genderspectrum.org/quick-links/understanding-gender/){target="_blank"} and [sex](https://www.who.int/genomics/gender/en/index1.html){target="_blank"} are not actually binary, the data used in this analysis only contain information for groups of individuals who answered the survey questions as male or female. 

## **What are the data?**
*** 
 
The data in this case study comes from the [National Youth Tobacco Survey (NYTS)](https://www.cdc.gov/tobacco/data_statistics/surveys/nyts/index.htm){target="_blank"} which is an annual survey that asks students  in high school and middle school (grades 6-12) about tobacco usage in the United States of America.

The data for this survey is freely available online at this [website](https://www.cdc.gov/tobacco/data_statistics/surveys/nyts/data/index.html){target="_blank"} with data from 1999, 2000, 2002, 2004, 2006, 2009,  and 2011-2019. We will be using data from **2015-2019** due to the fact that these years are the most recent that asked questions regarding e-cigarette usage.

Each year includes documentation, such as a [codebook](https://www.lib.ncsu.edu/data/icpsrfaq#whatare){target="_blank"} and an excel file containing the data:

```{r, echo = FALSE, fig.align ="center", out.width = "600 px"}

knitr::include_graphics(here::here("img", "data.png"))
```
Therefore, since we are using data from **2015-2019**, the data we are interested in are located in 5 different excel files, one for each year, each with its own [codebook](./docs/2019-nyts-dataset-and-codebook-microsoft-excel/2019-nyts-codebook-p.pdf){target="_blank"} (this is the one for 2019).

The codebook contains information describing the data within the excel file. 

As you can see the excel file contains very short variable names and numeric values, and it is not clear what they mean without the codebook:

```{r, echo = FALSE, fig.align ="center", out.width = "600 px"}

knitr::include_graphics(here::here("img", "excel.png"))
```

The codebook explains what the variables (the columns) are:
```{r, echo = FALSE, fig.align ="center", out.width = "600 px"}

knitr::include_graphics(here::here("img", "variables.png"))
```

And the codebook explains what the values for each variable are:

```{r, echo = FALSE, fig.align ="center", out.width = "600 px"}

knitr::include_graphics(here::here("img", "qn1.png"))
```

We will explain more later about what the values on the right indicate.

The reason that there are codebooks for each year is because the questions asked each year varied slightly.


The data in this survey are pooled cross-sectional data. In this study design, different subsets of students are surveyed each year and it is not clear which, if any, individuals participate more than once from one year to the next.

## **Data Import**
*** 

### Reading in the excel files
***
Since these excel files are so large (each has roughly 20,000 rows), it takes a bit of time for the data to load. To make the process faster, we previously imported these files, selected only the columns relevant to our questions of interest, and saved these data subsets as comma-separated (.csv) files. 

<details><summary> Click here for details on how the data were originally imported </summary>

First we created a vector of file names of all the different excel files. Using the `here()` function of the `here` package, we looked in all the directories of the project.
The `list.files()` function looked for all files with .xlsx within these sub-directories.
```{r}
excel_files<-list.files(here::here(), recursive = TRUE,
                  pattern = "*.xlsx")
excel_files
```

All the files were read using `read_excel()` of the `readxl` package. Using the `map()` function of the `purrr` package this was done efficiently for all of the excel files in the vector using one command. The `.` is used to indicate that we want to apply the `read_excel()` function to each element of the data that we just piped into the `map()` function.

Here we also used the `%>%` pipe which can be used to pass the input from one function to another for later sequential steps. 

This created a single list of tibbles (one for each file). 
```{r, eval = FALSE}
tbl_files <- excel_files %>% 
       map(~readxl::read_excel(.))
```

The elements of this list are in the same order as the elements of the `excel_files` vector, so we can extract the data for a particular file (year) by selecting a certain element of the list. However, it is safer to be able to select the data from this list for a specific year using a name based on the original vector of file names. We extracted a name from each Excel file name using the `str_extract()` function of the `stringr` package. Here we are keeping occurrences of the character string "nyts201" followed by a "5","6","7","8", or "9".

```{r}
tbl_names <- excel_files %>%
  str_extract("nyts201[5-9]")

tbl_names
```

These names became the names of the tibbles in the list of tibbles.
```{r, eval = FALSE}
names(tbl_files) <- tbl_names
```


Specific columns were selected using the `select()` function of `dplyr` from each of the tibbles using the variable name, as identified in the codebook as being of interest for our analysis.
In some cases functions like `starts_with()` of the `dplyr` package were used to select several variables at once. Most of the survey questions about tobacco use start with an `"E"` or a `"C"` according to the codebooks. 

```{r, eval = FALSE}

tbl_files[["nyts2015"]] <- tbl_files[["nyts2015"]] %>%
    dplyr::select(psu, #Primary Sampling Unit
                  finwgt,#Analysis Weight
                  stratum,#Sampling stratum
                  Qn1, #Age
                  Qn2, #Sex
                  Qn3, #Grade
                  starts_with("E",
                              ignore.case = FALSE),
                  starts_with("C",
                              ignore.case = FALSE),
                  )


tbl_files[["nyts2016"]] <- tbl_files[["nyts2016"]] %>%
    dplyr::select(psu,
                  finwgt,
                  stratum,
                  Q1, #Age
                  Q2, #Sex
                  Q3, #Grade
                  starts_with("E",
                              ignore.case = FALSE),
                  starts_with("C",
                              ignore.case = FALSE),
                  Q50A, #Menthol # What flavors of tobacco products have you used in the past 30 days? (Select one or more)
                  Q50B, #Clove or spice
                  Q50C, #Fruit
                  Q50D, #Chocolate
                  Q50E, #Alcoholic Drink
                  Q50F, #Candy/Desserts/Other Sweets
                  Q50G, #Some Other Flavor Not Listed Here
                  Q50H #I Did Not Use Flavored Tobacco Products In the Past
                  ) 

tbl_files[["nyts2017"]] <- tbl_files[["nyts2017"]] %>%
    dplyr::select(psu,
                  finwgt,
                  stratum,
                  Q1, #Age
                  Q2, #Sex
                  Q3, #Grade
                  starts_with("E",
                              ignore.case = FALSE),
                  starts_with("C",
                              ignore.case = FALSE),
                  Q50A, #Menthol # What flavors of tobacco products have you used in the past 30 days? (Select one or more)
                  Q50B, #Clove or spice
                  Q50C, #Fruit
                  Q50D, #Chocolate
                  Q50E, #Alcoholic Drink
                  Q50F, #Candy/Desserts/Other Sweets
                  Q50G, #Some Other Flavor Not Listed Here
                  Q50H #I Did Not Use Flavored Tobacco Products In the Past
                  )

tbl_files[["nyts2018"]] <- tbl_files[["nyts2018"]] %>%
    dplyr::select(psu,
                  finwgt,
                  stratum,
                  Q1, #Age
                  Q2, #Sex
                  Q3, #Grade
                  starts_with("E",
                              ignore.case = FALSE),
                  starts_with("C",
                              ignore.case = FALSE),
                  Q50A, #Menthol # What flavors of tobacco products have you used in the past 30 days? (Select one or more)
                  Q50B, #Clove or spice
                  Q50C, #Fruit
                  Q50D, #Chocolate
                  Q50E, #Alcoholic Drink
                  Q50F, #Candy/Desserts/Other Sweets
                  Q50G, #Some Other Flavor Not Listed Here
                  Q50H #I Did Not Use Flavored Tobacco Products In the Past
                  )

tbl_files[["nyts2019"]] <- tbl_files[["nyts2019"]] %>%
    dplyr::select(psu,
                  finwgt,
                  stratum,
                  Q1, #Age
                  Q2, #Sex
                  Q3, #Grade
                  starts_with("E",
                              ignore.case = FALSE),
                  starts_with("C",
                              ignore.case = FALSE),
                  Q40, #Brand, e-cigarettes
                  Q62A, #Menthol # What flavors of tobacco products have you used in the past 30 days? (Select one or more)
                  Q62B, #Clove or spice
                  Q62C, #Fruit 
                  Q62D, #Chocolate
                  Q62E, #Alcoholic Drink
                  Q62F, #Candy/Desserts/Other Sweets
                  Q62G, #Some Other Flavor Not Listed Here 
                  )

```

A directory called `data_reduced` was created for the new .csv files using the base `dir.create()` function. 
A csv file was created for each of the tibbles in the list using the `write_csv()` function of the `readr` package.
This was done all at once using the `map()` function.

AVOCADO: I got it to work using map; hopefully this is OK?

```{r, eval = FALSE}
# 
dir.create("docs/data_reduced")

names(tbl_files) %>% map(~write_csv(tbl_files[[.]], path=paste0("docs/data_reduced/", ., ".csv")))   

```

</details>



Now we will show how to read in the data from the five csv files that were created from the five different Excel files.

### Reading in the CSV files
***


```{r}
nyts_data <- list.files(here::here(),recursive = TRUE,
                   pattern = "*.csv")%>%
  map(~read_csv(.))


nyts_data_names <- list.files(recursive = TRUE,
                         pattern = "*.csv")%>%
  str_extract("nyts201[5-9]")

names(nyts_data) <- nyts_data_names
```


## **Data Exploration and Wrangling**
*** 

Our goal in this section is to adjust or "wrangle" the data from each year into a common format so that we can combine the data sets across years for our analysis, and so that we have values in our variables that are correct and easy to interpret. We will need to understand what is the same and what is different across the data from different years, rename and recode the variables (e.g., by replacing the numbers 1 and 2 with the values "Male" and "Female" for the `Sex` variable), and combine the data. We will walk through these steps below. 

First, let's take a look at our data. We can get a good sense of it using the `glimpse()` function of the `dplyr` package.

#### {.scrollable }

```{r}
dplyr::glimpse(nyts_data[["nyts2015"]])
```
####

### Updating the set of variables and their names
*** 

The easiest way of making it so that the data from the different years can be combined is by making sure the different data sets all contain the same variables that share the same names. In addition, giving the columns informative names will help make our code more readable. Currently, it isn't very clear what most of the variables indicate since the variable names are uninformative on their own, without the [codebook](./docs/2019-nyts-dataset-and-codebook-microsoft-excel/2019-nyts-codebook-p.pdf){target="_blank"}.

We want to rename variables like `Qn1` to something more meaningful like `Age`.

To do this we will use the `rename()` function of the `dplyr` package. The new name is always listed first before the `=`. This function will replace the old variable names with the new ones, i.e., after running the code below, there will no longer be a `Qn1` variable in the data set, but there will be an `Age` variable instead. We will start working with the 2015 data, and then move on to the other years down below.

```{r}

nyts_data[["nyts2015"]] <- nyts_data[["nyts2015"]] %>%
    dplyr::rename(Age=Qn1,
           Sex=Qn2,
           Grade=Qn3)
```


AVOCADO: I later learned that it is not correct that we need the same variables in the data set when using bind_rows, so why are we creating these dummy variables with NAs here? Won't they be automatically created later in any case?

We also need to add new variables about usage of brands and specific flavors of e-cigarettes, because although later years have questions about this, this year unfortunately only has broad questions about flavors. Eventually we want to put the data from each year together, so we need the same variables for each year. Thus we need to add variables for brand and flavor to the 2015 data set that just contain missing values (`NA`).

To create these new variables, we will use the `mutate` function of the `dplyr` package. The `mutate` function can also be used to change the existing variables and create new variables from the old ones.

```{r}
nyts_data[["nyts2015"]] <- nyts_data[["nyts2015"]] %>%
  dplyr::mutate(
           brand_ecig=NA,
           menthol=NA,
           clove_spice=NA,
           fruit=NA,
           chocolate=NA,
           alcoholic_drink=NA,
           candy_dessert_sweets=NA,
           other=NA,
           no_use=NA) 
```


Now we can see how the data has changed:

#### {.scrollable }
```{r}
glimpse(nyts_data[["nyts2015"]])
```
####


The data for 2016-2018 have many common attributes, so we will want to write code that can be applied to all three data sets. To do this, we will use a `function` in R, which is basically a piece of code that can be applied to similar but different objects in R (e.g., the data tibbles from each of these three years). For more information on functions, see for example [here](https://r4ds.had.co.nz/functions.html){target="_blank"}.

AVOCADO: I added a link to the r4ds chapter on functions; are there other places in the case studies that we discuss functions? Or additional links or resources you want to add?


These next 3 years have the same structure for many of the questions we are interested in. For example, they all have flavor questions, but not a brand question. Moreover, their variable names are consistent across the years; for each year, we want to replace the vague question name `Q50A` with the value `menthol` in all three data sets, and the same is true for the other flavor variables.  For each of these years, we also want to create a variable `brand_ecig` that just contains `NA` values.  

Since we want to perform the same modifications on the data from all three years, rather than repeating the same somewhat messy piece of code three times, we can do this more efficiently if we create a function to do all of these steps at once. Then we can use the `map_at()` function of the `purrr` package to apply the function we created (for renaming variables etc.) to the data from 2016-2018. By using `vars()` inside of the `map_at()` function we can specify what tibbles within our `nyts_data` list we want to include or exclude.

```{r}

Update_survey <- function(x) { x %>%
      rename(Age=Q1,
           Sex=Q2,
           Grade=Q3,
           menthol=Q50A,
           clove_spice=Q50B,
           fruit=Q50C,
           chocolate=Q50D,
           alcoholic_drink=Q50E,
           candy_dessert_sweets=Q50F,
           other=Q50G,
           no_use=Q50H) %>%
    mutate(brand_ecig=NA)
}

#few options to apply to the data:
#nyts_data <-nyts_data %>% map_at(vars(nyts2016, nyts2017, nyts2018), Update_survey)
#nyts_data <-nyts_data %>% map_at(c("nyts2016", "nyts2017", "nyts2018"), Update_survey)
nyts_data <-nyts_data %>% map_at(vars(-nyts2015, -nyts2019), Update_survey)
```

The final year, 2019, has a slightly different data structure compared to these earlier data sets. For example, it actually has a `brand_ecig` variable already, but does not contain a `no_use` variable. So we can't use the same function for this data set, but have to write an individual code chunk.

```{r}
nyts_data[["nyts2019"]] <- nyts_data[["nyts2019"]] %>%
    rename(brand_ecig=Q40,
           Age=Q1,
           Sex=Q2,
           Grade=Q3,
           menthol=Q62A,
           clove_spice=Q62B,
           fruit=Q62C,
           chocolate=Q62D,
           alcoholic_drink=Q62E,
           candy_dessert_sweets=Q62F,
           other=Q62G) %>%
    mutate(no_use="missing")
```


Now let's take a look at the variable names for each of the years using the `map` function from `purrr`.

#### {.scrollable }

```{r}
map(nyts_data, names)
```
####

It's looking better! There are still some differences in the set of variables in the different years, but things are much closer.

### Updating Values

Now that we have made some progress on the selection and names of the variables themselves, we will work on the values contained in the different variables.

We can start with updating the values for `Age` and `Grade`, so that they are more understandable. 

Recall from the codebook for this year's data set that `Age` isn't listed in the way one might expect, i.e., it is not just a number of years, but a numerically valued categorical variable.

```{r, echo = FALSE, fig.align ="center", out.width = "600 px"}

knitr::include_graphics(here::here("img", "qn1.png"))
```

The same is true for `Grade`:

```{r, echo = FALSE, fig.align ="center", out.width = "600 px"}

knitr::include_graphics(here::here("img", "grade.png"))
```

**This is why it is so important to always check the codebook!!**

We also  want to replace the value of `19` for `Age` to be `">18"` and the value of `13` for `Grade` to be replaced with `"Ungraded/Other"` Also according to the codebooks, numeric values of `1` indicate a survey answer of `FALSE`, while a value of `2` indicates `TRUE`. `Sex` needs to be recoded, but it is a bit unclear if it is encoded slightly differently across years.

AVOCADO: For the Sex variable, how did you determine "It's a bit difficult to tell if the encoding was the same across years" -- I don't see documentation of this (i.e., is this missing from the codebook somehow?), so it might be helpful to let the reader know how you figured this out. Carrie says to look at the codebooks and it will make sense; could consider screen shots. I looked at it again; the confusion seems to come from the SEX:RECODE variables, but we aren't using those. Otherwise, I do think this could be dropped.


Let's create a function to make all these updates except for `Sex`. We will use the `mutate` function again, as well as `mutate_all` and `recode` to replace specific values of certain variables.

```{r}
Update_values <- function(x) { x %>%
      mutate(Age=as.numeric(Age)+8,
             Grade=as.numeric(Grade)+5)%>%
      mutate(Age=as.factor(Age),
         Grade=as.factor(Grade),
         ) %>%
  mutate_all(~ replace(., . %in% c("*", "**"), NA))%>%
  mutate(Age=recode(Age,
                    `19` = ">18",
                    ),
         Grade=recode(Grade,
                      `13` = "Ungraded/Other")) %>%
  mutate_at(vars(starts_with("E", ignore.case = FALSE),
                   starts_with("C", ignore.case = FALSE)),
              list(~recode(.,
                           `1` = TRUE,
                           `2` = FALSE,
                           .default = NA,
                           .missing = NA)))
}

nyts_data <-nyts_data %>% map(.,Update_values)

```


Now let's update the `Sex` encoding:

It's a bit difficult to tell if the encoding was the same across years, so let's check using the `count()` function of the `dplyr` package.

According to the codebook, we should have:  
1) 8,958 males in 2015 
2) 10,438 males in 2016 
3) 8,881 males in 2017  
4) 10,069 males in 2018  
5) 9,803 males in 2019  

```{r}
count_sex <-function(x){x %>%count(Sex)}
nyts_data %>% map(., count_sex)
```


Thus, it looks like males were encoded as `1` for each year.

1) 8,958 males in 2015 where 1 = male  
2) 10,438 males in 2016 where 1 = male
3) 8,881 males in 2017 where 1 = male  
4) 10,069 males in 2018 where 1 = male
5) 9,803 males in 2019 where 1 = male


```{r}

  Update_sex <- function(x){x %>%
      mutate(Sex = as.factor(Sex)) %>%
         mutate(Sex = recode(Sex,
                      `1`= "male",
                      `2` = "female"))
  }

nyts_data <-nyts_data %>% map(., Update_sex)
count_sex <-function(x){x %>%count(Sex)}
nyts_data %>% map(., count_sex)

```

Looks good!



The years (2016-2019) that have flavors also need the flavor data to be logical (meaning TRUE or FALSE):

```{r}
Update_flavors <- function(x){x %>%
   mutate_at(vars(menthol:no_use),
              list(~recode(.,
                           `1` = TRUE,
                       .default = FALSE,
                      .missing = FALSE))) }

nyts_data  <- nyts_data  %>% map_at(vars(-nyts2015), Update_flavors)

```


Now there are just a few changes needed that are specific to 2019. Specifically, some of the 2019 questions use the values ".N", ".S", and ".Z" to indicate different types of missing data (see for example Q2 of the 2019 [codebook](./docs/2019-nyts-dataset-and-codebook-microsoft-excel/2019-nyts-codebook-p.pdf){target="_blank"}); we just want them to be replaced with `NA` values.


```{r}
nyts_data[["nyts2019"]] <- nyts_data[["nyts2019"]] %>%
  mutate_all(~ replace(., . %in% c(".N",".S",".Z"), NA)) %>%
  mutate(psu=as.character(psu)) %>%
  mutate(brand_ecig = recode(brand_ecig,
                                             `1` = "Other", #levels 1,8 combined to `Other` 
                                             `2` = "Blu",
                                             `3` = "JUUL",
                                             `4` = "Logic",
                                             `5` = "MarkTen",
                                             `6` = "NJOY",
                                             `7` = "Vuse",
                                             `8` = "Other"))

```

Great! Now our values don't need to be handled any differently for any of the years, thus we can combine the data across years.

Even though we have different numbers of variables for each year, we can coerce the data to be combined into one tibble by using the `bind_rows()` function of `dplyr`. Importantly, this function does not require that the columns be the same.  This will create NA values for any variable that is not present in given data frame but is  present in one of the other data frames that is being combined. Note that the `bind_cols()` function does expect that the rows match. The `.id` argument will create a new variable with values to link each row to its original data frame. For more information see [here](https://dplyr.tidyverse.org/reference/bind.html).


```{r}
nyts_data <- nyts_data %>%
  map_df(bind_rows, .id = "year")

glimpse(nyts_data)
```

We will want to do some of our analysis split by year, so we would like to be sure we have one variable that has the correct value for year. It looks like we just need to remove `"nyts"` from the year variable that we created from the names of the tibbles in our list and we should be all set. We will use another function from the `stringr` package to do this. The `str_remove()` function takes a string followed by a pattern and removes the pattern from the string.

```{r}
nyts_data <- nyts_data %>%
  mutate(year=as.numeric(str_remove(year,"nyts")))

```

Here is our clean and wrangled data:

#### {.scrollable}
```{r}
glimpse(nyts_data)

```

####


Note that there are several variables where there are similar names, but with a `C` compared to an `E` in the variable name. Those starting with `C` are related to questions about current usage (last 30 days), while those with an `E` are related to usage across the subjects whole life ("ever" usage). We will discuss these groups further below.

AVOCADO: I'm not sure where this should go, but I had the thought so I figured I'd include it. Maybe we can be more consistent about how we refer to students, subjects, respondents? Carrie suggests maybe trying to use students where we are talking about this specific data set, but respondents in the general survey weighting section.



### Variable Table
<details><summary> Click here to see a table about the final variables in our data set. </summary>

value 1 = yes, value 2 = no

Variable   | Details                                                                        
---------- |-------------
**year**  | the year that the survey results from a particular student were acquired  
**psu** | the primary sampling unit for the survey weighting  
**finwgt** | the final analysis weight for the survey weighting  
**stratum** | the stratum used for variance estimation for the survey weighting  
**Age** | the age of the student when they took the survey  
**Sex** | the sex of the student when they took the survey  
**Grade** | the grade of the the student when the took the survey  
**ECIGT** | student reported having ever tried cigarette smoking, even one or two puffs  
**ECIGAR** | student reported having ever tried cigar, cigarillo, or little cigar smoking, even one or two puffs  
**ESLT** | student reported having ever tried chewing tobacco, snuff, or dip  
**EELCIGT** | student reported having ever tried e-cigarettes  
**EROLLCIGTS** | student reported having ever tried roll-your-own cigarettes  
**EFLAVCIGTS** |  (2015 only) based on answer to "Which of the following tobacco products that you used in the past 30 days were flavored?"  
**EBIDIS** | student reported having ever tried bidis (small brown cigarettes wrapped in a leaf)  
**EFLAVCIGAR** | student reported having ever tried a flavored cigar (2015-2016)
**EHOOKAH** | student reported having ever smoked tobacco from a hookah or a waterpipe  
**EPIPE** | student reported having ever smoked tobacco from a pipe (not hookah)  
**ESNUS** | student reported having ever used snus, such as Camel or Malboro Snus  
**EDISSOLV** | student reported having ever tried dissolvable tobacco products such as Ariva, Stonewall, Camel orbs, Camel sticks, Marlboro sticks, or Camel strips  
**CCIGT** | student reported they smoked cigarettes on >= 1 of the past 30 days  
**CCIGAR** | student reported they smoked cigars on >= 1 of the past 30 days  
**CSLT** | student reported they used chewing tobacco, snuff, or dip on >= 1 of the past 30 days  
**CELCIGT** | student reported they used electronic cigarettes or e-cigarettes one or more days in the past 30
**CROLLCIGTS** | student reported they smoked roll-your-own cigarettes during the past 30 days  
**CFLAVCIGTS**| (2015 only) based on answer to "Which of the following tobacco products that you used in the past 30 days were flavored?" 
**CBIDIS** | student reported they smoked bidis during the past 30 days  
**CHOOKAH** | student reported they smoked tobacco in a hookah on >= 1 of the past 30 days  
**CPIPE** | student reported they smoked tobacco in a pipe (not hookah) during the past 30 days  
**CSNUS** | student reported they used snus during the past 30 days    
**CDISSOLV** | student reported they used dissolvable tobacco products such as Ariva, Stonewall, Camel orbs, Camel sticks, Marlboro sticks, or Camel strips during the past 30 days  
**brand_ecig** | student answer to "During the past 30 days, what brand of e-cigarettes did you usually use?"  
**menthol** | student selected Menthol or mint as the answer to "What flavors of tobacco products have you used in the past 30 days? (select one or more)"  
**clove_spice** | student selected clove or spice as the answer to "What flavors of tobacco products have you used in the past 30 days? (select one or more)"  
**fruit** | student selected fruit as the answer to "What flavors of tobacco products have you used in the past 30 days? (select one or more)"  
**chocolate** | student selected chocolate as the answer to "What flavors of tobacco products have you used in the past 30 days? (select one or more)"  
**alcoholic_drink** | student selected alcoholic drink (such as wine, cognac, margarita, or other cocktails) as the answer to "What flavors of tobacco products have you used in the past 30 days? (choose one or more)"  
**candy_dessert_sweets** |  student selected candy, desserts or other sweets as the answer to "What flavors of tobacco products have you used in the past 30 days? (choose one or more)" 
**other** | student selected some other flavor not listed as the answer to "What flavors of tobacco products have you used in the past 30 days? (choose one or more)" 
**no_use** | student reported that they did not use any flavored tobacco products in the past 30 days or did not answer the question about using flavors
**EHTP** | student reported having ever tried heated (also known as "heat-not-burn") tobacco products  
**CHTP** | student reported they used heated tobacco products during the past 30 days  
</details>

## **Data Visualization**
*** 

Recall that our main questions were:

1) How has tobacco and e-cigarette use by American youths changed since 2015?
2) How do vaping rates compare between males and females?
3) What vaping brands and flavors appear to be used the most frequently?  
We will base this on the following survey questions:   
> "During the past 30 days, what brand of e-cigarettes did you usually use?"   
>" What flavors of tobacco products have you used in the past
30 days?" 

4) Is there a relationship between e-cigarette/vaping use and other tobacco use?


We are now going to create data visualizations to explore each of these questions.

For many of these questions we will be interested in both **current** and **ever** users, so we will want to create a variable for labeling individuals who are current users of any tobacco product (or not, i.e., who do not currently use a tobacco product) and a variable for labeling individuals who are "ever users" of any tobacco product (or not, i.e., who have never used a tobacco product).

We define these two groups as follows:

1) **current** = students who used a product for >=1 day in the past 30 days  
2) **ever** =  students who report having used or tried a product at any point in time

All **current** users are therefore **ever** users but not all **ever** users are **current** users. Thus, **current** users are a subset of **ever** users.

To add these grouping variables to our data we will do a bit more wrangling using the `mutate()` function again of the `dplyr` package. As discussed above, our data set contains a set of questions that relate to whether the student has ever used the particular tobacco product (questions that start with the letter "E"), and questions that relate to whether the student currently uses the particular tobacco product (questions that start with the letter "C"). 

Here are some examples for these data entries:  

 - EPIPE: Students who reported they have smoked tobacco from a pipe (not hookah).  
 - CPIPE: Students who reported they smoked tobacco in a pipe (not hookah) during the past 30 days. 
 - EROLLCIGTS: RECODE: Students who reported they have tried smoking roll-your-own cigarettes. 
 - CROLLCIGTS: RECODE: Students who reported they smoked roll-your-own cigarettes during the past 30 days. 
 
Based on many questions like this:  
 
 In the past 30 days, which of the following products have you used on at least one day? (Select one or more) 
A. Roll-your-own cigarettes  
B. Pipes filled with tobacco (not hookah or waterpipe)  
C. Snus, such as Camel, Marlboro, or General Snus  
D. Dissolvable tobacco products such as Ariva, Stonewall, Camel orbs, Camel sticks, Marlboro sticks,
or Camel strips  
E. Bidis (small brown cigarettes wrapped in a leaf)  
F. I have not used any of the products listed above in the past 30 days  

 Which of the following tobacco products have you ever tried, even just one time? (Select one or more)  
A. Roll-your-own cigarettes  
B. Pipes filled with tobacco (not hookah or waterpipe)  
C. Snus, such as Camel, Marlboro, or General Snus  
D. Dissolvable tobacco products such as Ariva, Stonewall, Camel orbs, Camel sticks, Marlboro sticks, or Camel strips  
E. Bidis (small brown cigarettes wrapped in a leaf)  
F. I have never tried any of the products listed above   
 
 
We will sum across the variables that relate to ever or current tobacco usage questions to determine if the student answered yes to any of the ever or current questions. 

We will then use the `case_when()` function of the `dplyr` package to convert the sum values to `TRUE` or `FALSE` based on the threshold of zero. If the sum is greater than zero, then we know the student answered yes to at least one question. 
 
```{r}
nyts_data <-nyts_data %>%
    mutate(tobacco_sum_ever = select(., starts_with("E", ignore.case = FALSE)) %>%
               base::apply(1, sum, na.rm=TRUE),
           tobacco_sum_current = select(., starts_with("C", ignore.case = FALSE)) %>%
               apply(1, sum, na.rm=TRUE)) %>%
    mutate(tobacco_ever = case_when(tobacco_sum_ever > 0 ~ TRUE,
                                    tobacco_sum_ever == 0 ~ FALSE),
           tobacco_current = case_when(tobacco_sum_current > 0 ~ TRUE,
                                    tobacco_sum_current == 0 ~ FALSE)) 

```


#### {.scrollable}
```{r}
glimpse(nyts_data)
```
####

We are also interested in e-cigarette usage compared to other tobacco products, so we will create some variables related to the sum of all e-cigarette usage question variables and the sum of all tobacco usage question variables excluding those that are about e-cigarettes. There is only one variable about e-cigarette usage ever (EELCIGT) and one about current usage (CELCIGT).

```{r}
   nyts_data <-nyts_data %>% mutate(ecig_sum_ever = select(., EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           ecig_sum_current = select(., CELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_ever = select(., starts_with("E", ignore.case = FALSE)) %>%
               select(.,-EELCIGT) %>%
               apply(1, sum, na.rm=TRUE),
           non_ecig_sum_current = select(., starts_with("C", ignore.case = FALSE)) %>%
               select(.,-CELCIGT) %>%
               apply(1, sum, na.rm=TRUE)) %>%
    mutate(ecig_ever = case_when(ecig_sum_ever > 0 ~ TRUE,
                                    ecig_sum_ever == 0 ~ FALSE),
           ecig_current = case_when(ecig_sum_current > 0 ~ TRUE,
                                    ecig_sum_current == 0 ~ FALSE),
           non_ecig_ever = case_when(non_ecig_sum_ever > 0 ~ TRUE,
                                    non_ecig_sum_ever == 0 ~ FALSE),
           non_ecig_current = case_when(non_ecig_sum_current > 0 ~ TRUE,
                                    non_ecig_sum_current == 0 ~ FALSE))
```



#### {.scrollable}
```{r}
glimpse(nyts_data)
```

####

Finally, we are also interested in grouping students that only use e-cigarettes and those that only use other forms of tobacco.

Recall that current users are a subset of ever users, thus students would typically answer yes to having tried vaping products if they had used them one or more days in the past 30 days.

First we will make a small toy dataset called `test` to show what we will do with the larger dataset:
```{r}
test <- tibble(ecig_ever = c("TRUE", "TRUE", "TRUE", "TRUE", "FALSE", "FALSE", "TRUE", "FALSE", "FALSE"),
               non_ecig_ever = c("TRUE", "FALSE","FALSE","FALSE", "FALSE", "FALSE", "TRUE", "TRUE", "TRUE"),
               ecig_current = c("TRUE", "FALSE","FALSE", "TRUE", "TRUE", "FALSE", "FALSE", "FALSE", "FALSE"),
               non_ecig_current = c("TRUE", "FALSE","TRUE", "FALSE", "TRUE", "FALSE", "FALSE", "FALSE", "TRUE"))

test
```

Now, let's look at identifying students who have tried e-cigarettes, but are not current users, and who have never tried other tobacco products (and are therefore not current users). We will again use the `case_when()` and the `mutate` function to create new variables with specific values when certain conditions are met. In this case, we will specify that several conditions must be met by using the `&` symbol. For a value of `TRUE` for the new `ecig_only_ever` variable, all of the conditions combined with `&` must be met.  If any of the conditions are not met then the `ecig_only_ever` value will be `FALSE` based on the last line `TRUE ~ FALSE`.

```{r}

test <- test %>% mutate(ecig_only_ever = case_when(ecig_ever == TRUE & 
                                               non_ecig_ever == FALSE & 
                                                ecig_current == FALSE &
                                            non_ecig_current == FALSE ~ TRUE,
                                                         TRUE ~ FALSE))
test
```

We can see from the second row, that the `ecig_only_ever` is `TRUE` when we would expect it to be.
We can also see from the fourth row, that even though the student reported yes to ever trying e-cigarettes, because they also reported yes to currently using e-cigarettes the value for only ever trying e-cigarettes is `FALSE`. Additionally we can see from the seventh row that similarly even though the student reported yest to ever trying e-cigarettes, they also reported yes to ever trying other products, and the value for only ever trying e-cigarettes is `FALSE`. Importantly, we can see from the 6th row, that if all responses are negative than the value is `FALSE`.

Now we will expand this to the other possible categories. In this case we note that since current users are a subset of ever users, it doesn't matter if a user reports yes to ever trying  e-cigarettes, they can still be a current user.


```{r}
test <-test %>% mutate(ecig_only_ever = case_when(ecig_ever == TRUE & 
                                              non_ecig_ever == FALSE & 
                                               ecig_current == FALSE &
                                           non_ecig_current == FALSE ~ TRUE,
                                                       TRUE ~ FALSE),
                 ecig_only_current = case_when(ecig_current == TRUE &
                                              non_ecig_ever == FALSE &
                                           non_ecig_current == FALSE ~ TRUE,
                                                       TRUE ~ FALSE),
               non_ecig_only_ever = case_when(non_ecig_ever == TRUE &
                                                  ecig_ever == FALSE &
                                               ecig_current == FALSE &
                                           non_ecig_current == FALSE ~ TRUE, 
                                                        TRUE ~ FALSE),
         non_ecig_only_current = case_when(non_ecig_current == TRUE &
                                                  ecig_ever == FALSE &
                                               ecig_current == FALSE ~ TRUE,
                                                        TRUE ~ FALSE),
                           no_use = case_when(non_ecig_ever == FALSE &
                                                  ecig_ever == FALSE &
                                               ecig_current == FALSE &
                                           non_ecig_current == FALSE ~TRUE,
                                                        TRUE ~ FALSE)) 
glimpse(test)
```



Take a minute to check that the values are what we would expect.

OK, now we are going to make a `Group` variable based on the new variables we just made to classify students into one of four mutually exclusive and exhaustive categories. In this case we will have a particular value based on one condition **or** another. This **or** conditional is specified by the `|` symbol. Only one of the conditions needs to exist for that particular value, whereas when we used the `&` symbol, all of the conditions had to be met. If a student has ever tried or currently uses e-cigarettes, but has never tried other tobacco products, the value will be `Only e-cigarettes`. If a student has ever tried or is a current user of other tobacco products, but has never tried e-cigarettes, the value will be `Only other products`. If the value of the `no_use` variable is simply `TRUE`, then the `Group` variable value will be `Neither`. Finally, if a student has tried or currently uses both e-cigarettes and other tobacco products the `Group` variable value will be `Combination of products`. Therefore the values for the usage of the variables based on only using e-cigarettes or only other products will all be `FALSE`. Of course the student will also have had to answer yes to any of the use-based questions, thus the no_use variable would also need to be `FALSE`.

AVOCADO: Consider dropping the last two sentences in the paragraph above? i.e., from "Therefore the values..." to "would also need to be `FALSE`." I'm not sure what they are adding. LEAVE IN FOR CARRIE TO REVIEW.

```{R}

test <- test %>% 
  mutate(Group = case_when(ecig_only_ever == TRUE |
                        ecig_only_current == TRUE ~ "Only e-cigarettes",
                       non_ecig_only_ever == TRUE |
                    non_ecig_only_current == TRUE ~ "Only other products",
                                   no_use == TRUE ~ "Neither",
                           ecig_only_ever == FALSE &
                        ecig_only_current == FALSE &
                       non_ecig_only_ever == FALSE &
                    non_ecig_only_current == FALSE &
                                   no_use == FALSE ~ "Combination of products"))


test %>% count(Group)

glimpse(test)
```

OK, now that we have seen how this works with our toy dataset, we will apply our code to our `nyts_data`.

```{r}
    nyts_data %<>%
                mutate(ecig_only_ever = case_when(ecig_ever == TRUE & 
                                              non_ecig_ever == FALSE & 
                                               ecig_current == FALSE &
                                           non_ecig_current == FALSE ~ TRUE,
                                                       TRUE ~ FALSE),
                 ecig_only_current = case_when(ecig_current == TRUE &
                                              non_ecig_ever == FALSE &
                                           non_ecig_current == FALSE ~ TRUE,
                                                       TRUE ~ FALSE),
               non_ecig_only_ever = case_when(non_ecig_ever == TRUE &
                                                  ecig_ever == FALSE &
                                               ecig_current == FALSE &
                                           non_ecig_current == FALSE ~ TRUE, 
                                                        TRUE ~ FALSE),
         non_ecig_only_current = case_when(non_ecig_current == TRUE &
                                                  ecig_ever == FALSE &
                                               ecig_current == FALSE ~ TRUE,
                                                        TRUE ~ FALSE),
                           no_use = case_when(non_ecig_ever == FALSE &
                                                  ecig_ever == FALSE &
                                               ecig_current == FALSE &
                                           non_ecig_current == FALSE ~TRUE,
                                                        TRUE ~ FALSE)) %>%
                    mutate(Group = case_when(ecig_only_ever == TRUE |
                                          ecig_only_current == TRUE ~ "Only e-cigarettes",
                                         non_ecig_only_ever == TRUE |
                                      non_ecig_only_current == TRUE ~ "Only other products",
                                                     no_use == TRUE ~ "Neither",
                                             ecig_only_ever == FALSE &
                                          ecig_only_current == FALSE &
                                         non_ecig_only_ever == FALSE &
                                      non_ecig_only_current == FALSE &
                                                     no_use == FALSE ~ "Combination of products"))
```


Lastly, it can be very helpful to have the total number of students surveyed each year. We can easily add a variable for this by using the `add_count()` function of the `dplyr` package. This will create a variable called `n` which will show the total number of survey responses for that year.

```{r}
nyts_data %<>% add_count(year)

```

#### {.scrollable}
```{r}
glimpse(nyts_data)
```

### Question 1 

Recall that we are interested in investigating how vaping product use has compared with other tobacco use over time. To answer this, we first want to get a sense of how tobacco use has changed in general since 2015. 

To create a visualization of how tobacco usage has changed over time, we will first convert the usage data to a percent value for each year, telling us what percent of respondents fall into a particular usage category. To do this we will use the `group_by()` and `summarize()` functions of the `dplyr` package. This will create new variables which we will name `Ever` and `Current` based on the  percentages of `TRUE` values for `tobacco_ever` and `tobacco_current` for each year. In this case the `mean()` function is used to calculate the percentages based on an automatic conversion that R does where for TRUE/FALSE variables, `TRUE` is given a value of one and `FALSE` is given a value of zero. The mean of a 0-1 binary variable is just the percent of the time the value is 1. NA values do not contribute to the total count when we include the argument `na.rm = TRUE` to our function call. 

<details> <summary> Click here to see a toy example:</summary>
```{r}
# the test data has 3 TRUE values and 7 FALSE values
test <-tibble("var1"= c("TRUE", "TRUE", "TRUE", rep("FALSE", 7)))
test%<>% mutate(var1 = as.logical(var1))
test

test %>% summarize(Percentage = mean(var1)*100)


#the test data has 3 TRUE values, 3 FALSE values, and 4 NA value
test <-tibble("var1"= c("TRUE", "TRUE", "TRUE", rep("FALSE", 3), rep("NA", 4)))
test%<>% mutate(var1 = as.logical(var1))
test

test %>% summarize(Percentage = mean(var1, na.rm = TRUE)*100)
```
</details>

And now back to our data:

```{r}

nyts_data %>%
    dplyr::group_by(year) %>%
    dplyr::summarize(Ever = (mean(tobacco_ever, na.rm = TRUE)*100),
                  Current = (mean(tobacco_current, na.rm = TRUE)*100))

```

We will use the `pivot_longer` function to take all columns except year (in this case the `Ever` and `Current` columns), to create a column called `User` that will contain the current column name information and a column called `Percentage of students` which will contain the mean percentage values that we just calculated. This converts our data into a format called "long" format.

```{r}

nyts_data %>%
    dplyr::group_by(year) %>%
    dplyr::summarize(Ever=(mean(tobacco_ever, na.rm = TRUE)*100),
                     Current=(mean(tobacco_current, na.rm = TRUE)*100))%>%
   pivot_longer(cols = -year, names_to = "User", values_to = "Percentage of students")

```
You may have noticed that our data is longer than it used to be! Hence the name of the function `pivot_longer()`. Data is often easier to plot when it is in this format.
  
Now we will use this data to create a plot using the `ggplot2` package. 

The first thing we do to create a plot is specify what data we are using for our x axis and y axis with the`aes()` argument of the `ggplot()` function. Then we add layers to our plot that specify what type of plot we would like to create. We can use the `geom_line()` function to create lines for each type of user. We specify that we want to use different line types for each user using `aes()`. We will also add points to our lines using the `geom_point()` function. We can add additional layers to specify colors and details about labels and legends etc.
  
```{r}  
  
plot1 <- nyts_data %>%
    dplyr::group_by(year) %>%
    dplyr::summarize(Ever=(mean(tobacco_ever, na.rm = TRUE)*100),
                     Current=(mean(tobacco_current, na.rm = TRUE)*100))%>%
  pivot_longer(cols = -year, names_to = "User", values_to = "Percentage of students")%>%
    ggplot(aes(x=year,y=`Percentage of students`)) +
    geom_line(aes(linetype=User)) + 
  geom_point(show.legend = FALSE, size = 2) +
  # this allows us to choose what type of line we want for each line
  scale_linetype_manual(values = c(2,1)) +
  # this allows us to specify how the y-axis should appear
  scale_y_continuous(breaks = seq(0,70,by=10),
                       labels = seq(0,70,by=10),
                       limits = c(0,70)) +
  #this adjusts the background style of the plot
    theme_linedraw() +
  # this moves the legend to the bottom of the plot and removes the x axis title
    theme(legend.position = "bottom",
          axis.title.x = element_blank()) +
    labs(title = "How has tobacco use varied over the years?",
         y = "% of students")

plot1 
```

Nice! Now we can see how overall tobacco usage has changed since 2017. It appears that usage first decreased from 2015 to 2017 and then increased a bit since 2017, surpassing the levels in 2015.

What about e-cigarette use? How has the usage of e-cigarettes changed over time?

```{r}
plot1a <- nyts_data %>%
    dplyr::group_by(year) %>%
    dplyr::summarize(Ever=(mean(ecig_ever, na.rm = TRUE)*100),
                     Current=(mean(ecig_current, na.rm = TRUE)*100))%>%
  pivot_longer(cols = -year, names_to = "User", values_to = "Percentage of students")%>%
    ggplot(aes(x=year,y=`Percentage of students`)) +
    geom_line(aes(linetype=User)) + 
  geom_point(show.legend = FALSE, size = 2) +
  # this allows us to choose what type of line we want for each line
  scale_linetype_manual(values = c(2,1)) +
  # this allows us to specify how the y-axis should appear
  scale_y_continuous(breaks = seq(0,60,by=10),
                        labels = seq(0,60,by=10),
                        limits = c(0,60)) +
  #this adjusts the background style of the plot
    theme_linedraw() +
  # this moves the legend to the bottom of the plot and removes the x axis title
    theme(legend.position = "bottom",
          axis.title.x = element_blank()) +
    labs(title = "How has e-cigarette use varied over the years?",
         y = "% of students")

plot1a 



```
It looks like the shape of the plot is very similar to tobacco usage overall. We see a downward trend until 2017 when the rate of both current and ever users increased. Recall that this is in agreement with the articles that we referenced earlier. We can see that the slope looks steeper for e-cigarette usage as compared to all tobacco products (including e-cigarettes).


Now let's plot this data together on the same plot.

We will have four groups (e-cigarette ever users, e-cigarette current users, tobacco ever users, and tobacco current users) to plot, therefore, it would be useful to add color to our plot. Keep in mind that e-cigarette users are a subset of any tobacco product users.

One important thing to keep in mind when creating plots is that individuals with color blindness may have a difficult time distinguishing groups when certain color choices are used. 

One great option is to use the `viridis` package, which offers color palettes with colors that are still distinguishable by individuals with most forms of color blindness. 

We can choose which colors we want to use by using the `show_col()` function of the `scales` package.

Here are some color options:
```{r}

scales:: show_col(viridis_pal()(6))
v_colors =  viridis(6)[c(1,4)]
```

We will select the first and fourth colors for our plot. To add these specific colors we will use the `scale_color_manual()` function of the `ggplot2` package.

We will calculate the mean ever and current usage percentages for students who used e-cigarettes or any tobacco products (including e-cigarettes) for each year again using the `group_by()` and `summarize()` functions. We will again use the `pivot_longer` function to convert our data to long format. We will also use the `separate()` function of the `tidyr` package to create two variables from one of the variables. This is done by separating by, in this case, an underscore. 


```{r}
 nyts_data %>%
    dplyr::group_by(year) %>%
    dplyr::summarize("Ever_Any Tobacco Product \n (including e-cigarettes)"=(mean(tobacco_ever, na.rm = TRUE)*100),
                     "Current_Any Tobacco Product \n (including e-cigarettes)"=(mean(tobacco_current, na.rm = TRUE)*100),
                     "Ever_E-cigarettes"=(mean(ecig_ever, na.rm = TRUE)*100),
                     "Current_E-cigarettes"=(mean(ecig_current, na.rm = TRUE)*100))%>%
  pivot_longer(cols = -year, names_to = "User", values_to = "Percentage of students")%>%
  separate(User, into = c("User", "Product"), sep = "_")%>%
  head()


plot1t <- nyts_data %>%
    dplyr::group_by(year) %>%
    dplyr::summarize("Ever_Any Tobacco Product \n (including e-cigarettes)"=(mean(tobacco_ever, na.rm = TRUE)*100),
                     "Current_Any Tobacco Product \n (including e-cigarettes)"=(mean(tobacco_current, na.rm = TRUE)*100),
                     "Ever_E-cigarettes"=(mean(ecig_ever, na.rm = TRUE)*100),
                     "Current_E-cigarettes"=(mean(ecig_current, na.rm = TRUE)*100))%>%
  pivot_longer(cols = -year, names_to = "User", values_to = "Percentage of students")%>%
  separate(User, into = c("User", "Product"), sep = "_")%>%
    ggplot(aes(x=year,y=`Percentage of students`, color = Product)) +
    geom_line(aes(linetype=User)) + 
  geom_point(show.legend = FALSE, size = 2) +
  # this allows us to choose what type of line we want for each line
  scale_linetype_manual(values = c(2,1)) +
  # we want purple associated with e-cigarettes to be consistent with later plots
  scale_color_manual(values = rev(v_colors))+
  # this allows us to specify how the y-axis should appear
  scale_y_continuous(breaks = seq(0,60,by=10),
                        labels = seq(0,60,by=10),
                        limits = c(0,60)) +
  #this adjusts the background style of the plot
    theme_linedraw() +
  # this moves the legend to the bottom of the plot and removes the x axis title
    theme(legend.position = "bottom",
          axis.title.x = element_blank()) +
    labs(title = "How has tobacco use varied over the years?",
         y = "% of students")

plot1t 

```

We see an increase in all categories starting in 2017, but the rate of increase is higher for students using only e-cigarettes (current or ever users), as shown by the higher slope of the e-cigarette lines.

In the above plots, the "Any tobacco product" groups include individuals in the "E-cigarette only" groups. Now let's plot students in two mutually exclusive groups on the same plot: those who reported either using only e-cigarettes or only other tobacco products (besides e-cigarettes), but not both. 

We will calculate the mean ever and current usage percentages for students in these two mutually exclusive groups, again using the `group_by()` function and the `summarize()` function. We will again use the `pivot_longer` function to convert our data to long format. We will also again use the `separate()` function of the `tidyr` package to create two variables from one variable. This is done by separating by, in this case, a space. 

```{r}

nyts_data %>%
    dplyr::group_by(year) %>%
    dplyr::summarize("Ever_E-cigarette"=(mean(ecig_only_ever, na.rm = TRUE)*100),
                     "Current_E-cigarette"=(mean(ecig_only_current, na.rm = TRUE)*100),
                     "Ever_Non-e-cigarette" =(mean(non_ecig_only_ever, na.rm = TRUE)*100),
                     "Current_Non-e-cigarette"=(mean(non_ecig_only_current, na.rm = TRUE)*100)) %>%
  pivot_longer(cols = -year, names_to = "User", values_to = "Percentage of students") %>%
  tidyr::separate(User, into = c("User", "Product"), sep = "_") %>%
  head()

plot1c <- nyts_data %>%
    dplyr::group_by(year) %>%
    dplyr::summarize("Ever_E-cigarette"=(mean(ecig_only_ever, na.rm = TRUE)*100),
                     "Current_E-cigarette"=(mean(ecig_only_current, na.rm = TRUE)*100),
                     "Ever_Non-e-cigarette" =(mean(non_ecig_only_ever, na.rm = TRUE)*100),
                     "Current_Non-e-cigarette"=(mean(non_ecig_only_current, na.rm = TRUE)*100))%>%
  pivot_longer(cols = -year, names_to = "User", values_to = "Percentage of students")%>%
  separate(User, into = c("User", "Product"), sep = "_") %>%
    ggplot(aes(x=year,y=`Percentage of students`, color = Product)) +
    geom_line(aes(linetype=User)) + 
  geom_point(show.legend = FALSE, size = 2) +
  # this allows us to choose what type of line we want for each line
  scale_linetype_manual(values = c(2,1)) +
  # this allows us to specify how the y-axis should appear
  scale_y_continuous(breaks = seq(0,30,by=10),
                        labels = seq(0,30,by=10),
                        limits = c(0,30)) +
  scale_color_manual(values = v_colors)+
  #this adjusts the background style of the plot
    theme_linedraw() +
  # this moves the legend to the bottom of the plot and removes the x axis title
    theme(legend.position = "bottom",
          axis.title.x = element_blank()) +
    labs(title = "How has use of only e-cigarettes and \n only tobacco products besides e-cigarettes varied over time?",
         y = "% of students")

plot1c

```

Very interesting! We can see from this plot that the percentage of students who had currently used (or ever tried) only e-cigarettes greatly increased starting in 2017, while in contrast the percentage of students who had ever tried only non-e-cigarette tobacco products actually diminished over time. In fact, we can see that in 2019 the percentage of students who were current e-cigarette users surpassed the percentage that had tried a non-e-cigarette product even just once. 


Recall that we made a variable called `Group` that identified students who used either just e-cigarette products, just other tobacco products (besides e-cigarettes), or students who used both e-cigarettes and some other type of tobacco product.

```{r}
nyts_data %>%
  count(Group)
```

We will now make a plot over time of each of these groups. Since we will have 4 total groups, we will use 4 of the viridis colors.
Notice, that in this case we are grouping by three variables by simply separating the variables that we want to group by with a comma in our `group_by()` function like this: `group_by(Group, year, n)`. 

```{r}

nyts_data %>%
  group_by(Group, year, n) %>%
  summarise(group_count = n()) %>%
  mutate("Percentage of students" = group_count /n*100) %>%
  head()

v_colors =  viridis(5)[1:4]

nyts_data %>%
  group_by(Group, year, n) %>%
  summarise(group_count = n()) %>%
  mutate("Percentage of students" = group_count /n*100) %>%
  ggplot(aes(x = year, y = `Percentage of students`, color = Group))+
  geom_point(size = 2)+
  geom_line() +
  scale_color_manual(values = v_colors)+
  theme_linedraw() +
  labs(x = "Year")

```

We can see that the majority of students did not report using any tobacco products. Of the students that did report using tobacco products, the majority of the students used both e-cigarettes and some other tobacco product. Again, a much larger percentage reported using only e-cigarettes rather than only other tobacco products in 2019.

We will further explore the relationship between e-cigarette usage and other tobacco products a bit later in the case study.


### Question 2

Now we want to look how e-cigarette smoking rates compare between males and females across time. 


We will calculate the percent ever and current e-cigarette users for each year and sex category again using the `group_by()` function and the `summarize()` function.  We will again use the `pivot_longer` function to convert our data to long format. Because there are some missing values for the `Sex` variable, we will filter these out; based on the codebook, there is little information about why these values are missing, so our survey data only allows us to consider two reported sex values with confidence.

AVOCADO: Carrie, I am filtering out the NAs for Sex now; can you check/modify the language I included above? I feel like you put it more eloquently on the phone. 

AVOCADO: Also, in this plot title e-cigarette is changed to vaping -- is that on purpose? Should we make that change earlier or throughout the case study? Or stick with e-cigarette?

```{r}

v_colors =  viridis(6)[c(1,4)]

nyts_data %>%
     filter(!is.na(Sex)) %>%
     group_by(year, Sex) %>%
     summarize(Ever=(mean(EELCIGT, na.rm = TRUE)*100),
               Current=(mean(CELCIGT, na.rm = TRUE)*100))%>%
     #filter(!is.na(Sex)) %>%
     pivot_longer(cols = Ever:Current, 
               names_to = "User", 
               values_to = "Percentage of students") %>%
     head()

plot2 <- nyts_data %>%
     filter(!is.na(Sex)) %>%
     group_by(year, Sex) %>%
     summarize(Ever=(mean(EELCIGT, na.rm = TRUE)*100),
               Current=(mean(CELCIGT, na.rm = TRUE)*100))%>%
    # filter(!is.na(Sex)) %>%
     pivot_longer(cols = Ever:Current, 
               names_to = "User", 
               values_to = "Percentage of students") %>%
    ggplot(aes(x = year,y =`Percentage of students`, color = Sex)) +
    geom_line(aes(linetype = User)) + 
    geom_point(show.legend = FALSE, size = 2) +
    scale_linetype_manual(values = c(2,1)) +
    scale_color_manual(values =v_colors)+
    theme_linedraw() +
    theme(legend.position = "bottom",
#          axis.text.x = element_text(angle = 90),
          axis.title.x = element_blank()) +
    labs(title = "How do vaping rates compare between males and females?",
         subtitle = "Current and ever users by sex",
         y = "% of students")

plot2
```

It looks like the rates are fairly similar between the sexes, however the rate for males appears to be consistently higher across time.


### Question 3

We are also interested in what vaping brands and flavors appear to be used the most frequently. Only the 2019 data set has this information. Therefore, we will filter for just this year using the `filter()` function of  the `dplyr` package. We will use the `summarize()` function slightly differently this time, to calculate the total number of students using each brand using the `n()` function and the `sum()` function to calculate the percent for each brand based on the counts. We will also reorder the factor levels for the brand names so that they are in descending order of percent use, using the `fct_reorder()` function from `dplyr`. This will make them appear in decreasing order of percent use on the plot.

We will make a bar plot this time by using `geom_bar`. Importantly we assign the `stat` argument to `identity`, so that we are using the percentages that we calculated not the counts which is what is used by default. When color in specified outside of the `aes()` argument, this determines the border color of the bars, which in this case will be black.

```{r}

 nyts_data %>%
    filter(year==2019) %>%
    group_by(brand_ecig) %>%
    filter(!is.na(brand_ecig)) %>%
  summarize(n = n()) %>%
    mutate(total = sum(n),
           Percent = n*100/total) %>%
    mutate(brand_ecig = fct_reorder(brand_ecig, desc(Percent)))


plot3 <- nyts_data %>%
    filter(year==2019) %>%
    group_by(brand_ecig) %>%
    filter(!is.na(brand_ecig)) %>%
  summarize(n = n()) %>%
    mutate(total = sum(n),
           Percent = n*100/total) %>%
    mutate(brand_ecig = fct_reorder(brand_ecig, desc(Percent))) %>%
    ggplot(aes(x=brand_ecig,y=Percent, fill=brand_ecig)) +
    geom_bar(stat="identity", color = "black") +
    theme_linedraw() +
    theme(legend.position = "none",
          axis.text.x = element_text(size=10),
          axis.title.x = element_blank()) +
    labs(title = "What vaping brands appear to be used the most frequently?",
         subtitle = "Brand of e-cigarette most frequently used in the last 30 days (2019)",
         y = "% of e-cigarette users responding")

plot3
```

Juul appears to be the most widely used brand. This is in agreement with a recent [article](https://tobaccocontrol.bmj.com/content/tobaccocontrol/28/2/146.full.pdf), whose most recent data was from 2017:

```{r, echo=FALSE, fig.cap="Huang J, Duan Z, Kwok J, et al. Tob Control 2019;28:146–151.", out.width = '100%'}
knitr::include_graphics(here::here("img", "HuangJ_DuanZ_KwokJ_et_al_TobaccoControl_Figure1.png"))
```

We are also interested in how the usage of different flavors has changed over time. 


To evaluate this we will calculate the percentage of students using each flavor each year - this includes usage of any type of flavored tobacco product. We will exclude 2015 data, as no specific flavor questions were asked at that time.

Recall that `NA` values are not included in calculating the total count for our percentages. However all of these flavor questions had complete reporting and did not have `NA` values. Therefore, these values reflect the percentage of students reporting using a particular favor out of all students surveyed (including those that did not use any tobacco products). Also students were allowed to select more than one flavor. You can see whether these variables had complete reporting by checking the `NA` values using the base `summary` function. Alternatively you can create a visual representation using the `vis_miss()` function of the `naniar` package.

#### {.scrollable}
```{r}
# Scroll through the output!
nyts_data %>% 
  filter(year!=2015) %>% 
  summary()

```
####


```{r}
nyts_data %>%
  filter(year!=2015) %>%
  select(menthol:alcoholic_drink)%>%
  vis_miss()

```

The plot above confirms that these variables have no `NA` values (because all fields indicate 100% of data is present).

```{r}
plot4 <- nyts_data %>%
  filter(year!=2015) %>%
  group_by(year) %>%
      summarize(Menthol = (mean(menthol)*100),
       `Clove or Spice` = (mean(clove_spice)*100),
                  Fruit = (mean(fruit)*100),
              Chocolate = (mean(chocolate)*100),
      `Alcoholic Drink` = (mean(alcoholic_drink)*100),
`Candy/Desserts/Sweets` = (mean(candy_dessert_sweets)*100),
                  Other = (mean(other)*100)) %>%
      pivot_longer(cols = - year, names_to = "Flavor", values_to = "Percentage of students") %>%
  rename(Year = year) %>%

 ggplot(aes(y = `Percentage of students`, 
            x = Year, 
         fill = reorder(Flavor, `Percentage of students`)))+
geom_bar(stat = "identity", position = "dodge", color = "black")+
  scale_fill_viridis(discrete=TRUE)+
  theme_linedraw() +
  guides(fill=guide_legend("Flavor"))+
  labs(title = "What flavors appear to be used the most frequently?",
       subtitle = "Flavors of tobacco products used in the past 30 days")

plot4 
```

From this plot, we can see that fruit flavors are the most widely used products, followed by menthol or mint flavored products. We can also see that there was a general increase in the usage of flavored products over time.

We will now look specifically at the usage of flavored e-cigarette products vs other flavored tobacco products. 

Recall that we made a variable called `Group` that identified students who used either just e-cigarette/vaping products, just other tobacco products (besides e-cigarettes), or students who used both e-cigarettes and some other type of tobacco product. We will compare the usage of these flavors for these different groups. We also perform some data summaries to decide how to order the panels (flavors) for display.



```{r}

v_colors =  viridis(5)[1:4]

plot5 <- nyts_data %>%
  filter(year != 2015) %>%
  group_by(year, Group) %>%
        summarize(Menthol = (mean(menthol)*100),
         `Clove or Spice` = (mean(clove_spice)*100),
                    Fruit = (mean(fruit)*100),
                Chocolate = (mean(chocolate)*100),
        `Alcoholic Drink` = (mean(alcoholic_drink)*100),
  `Candy/Desserts/\nSweets` = (mean(candy_dessert_sweets)*100),
                    Other = (mean(other)*100),
              Respondents = n()) %>%
  #converting all columns between and including Menthol and Other to one column called Flavor
  pivot_longer(cols = Menthol:Other, names_to = "Flavor", values_to = "Percentage of students") %>%
  group_by(Flavor) %>%
  # calculate the count of students in the year/group combination who used that flavor
  mutate(affirmative=(Respondents * `Percentage of students`)/100) %>%
  # calculate the fraction of total respondents who used that flavor
  mutate(flavor_mean = sum(affirmative)/sum(Respondents)) %>%
  ungroup() %>%
  # reorder the levels of Flavor to be in increasing order of percent of students who used that flavor
  mutate(flavor_mean_rank = dense_rank(flavor_mean),
         Flavor = fct_reorder(Flavor, flavor_mean_rank)) %>%
  ggplot(aes(x=year, y=`Percentage of students`, color=Group)) +
  facet_grid(~Flavor)+
  geom_line() + 
  geom_point(show.legend = FALSE, size = 2) + 
  scale_color_manual(values = v_colors) +
  theme_linedraw() +
  theme(legend.position = "bottom",
          axis.title.x = element_blank(),
        axis.text.x = element_text(angle = 90),
          strip.text.x = element_text(size = 10, face = "bold")) + 
  labs(title = "Among different product users, what flavors are most frequently used?")

plot5
```


We can see from this plot that there has been an increase in the number of students reporting using flavored tobacco products. Users who use both e-cigarettes and other tobacco products appear to report using flavored products the most, followed by users who only use e-cigarettes.

### Question 4

Is there a relationship between vaping rates and tobacco use? Now we will investigate the usage of e-cigarettes compared to other tobacco products in greater depth. 

First let's take a look at how e-cigarette usage and cigarette usage compare. We will select the data that specifically has to do with these products.

```{r}


v_colors =  viridis(6)[c(1,4)]

nyts_data %>%
    group_by(year) %>%
    summarize("Cigarettes, Ever" = (mean(ECIGT, na.rm = TRUE)*100),
            "E-cigarettes, Ever" = (mean(EELCIGT, na.rm = TRUE)*100),
           "Cigarettes, Current" = (mean(CCIGT, na.rm = TRUE)*100),
         "E-cigarettes, Current" = (mean(CELCIGT, na.rm = TRUE)*100)) %>%
    pivot_longer(cols= -year, names_to = "Category", values_to = "Percentage of students") %>%
    separate( Category, into= c("Product", "User"), sep = ", ") %>%
    head()


plot6 <- nyts_data %>%
    group_by(year) %>%
    summarize("Cigarettes, Ever" = (mean(ECIGT, na.rm = TRUE)*100),
            "E-cigarettes, Ever" = (mean(EELCIGT, na.rm = TRUE)*100),
           "Cigarettes, Current" = (mean(CCIGT, na.rm = TRUE)*100),
         "E-cigarettes, Current" = (mean(CELCIGT, na.rm = TRUE)*100)) %>%
    pivot_longer(cols= -year, names_to = "Category", values_to = "Percentage of students")%>%
    separate( Category, into= c("Product", "User"), sep = ", ") %>%
    ggplot(aes(x=year,y=`Percentage of students`, color = Product, linetype = User)) +
    geom_line() + 
  geom_point(show.legend = FALSE, size = 2) +
  scale_linetype_manual(values = c(2,1)) +
  scale_color_manual(values = v_colors) +
    theme_linedraw() +
    theme(legend.position = "bottom",
          axis.title.x = element_blank()) +
    labs(title = "How does e-cigarette use compare to cigarette use?",
         subtitle = "Current and ever users of e-cigarettes and cigarettes",
         y = "% of students")

plot6
```

Interesting! we can see that in 2019 the percentage of students that reported currently using e-cigarettes had surpassed those that ever tried (even just once) a cigarette. Overall cigarette usage appears to be declining over time. This is not the case for e-cigarettes.


Now we will look at students who reported that they had ever tried e-cigarettes or non-cigarette products. In this case we will not separate out users who specifically only used one or the other. Therefore, the students included in this plot who reported as having ever tried e-cigarettes might also be  current users of non-e-cigarette products or may have at least tried non-e-cigarette products.


```{r}

v_colors =  viridis(6)[c(1,4)]

plot7 <- nyts_data %>%
    group_by(year) %>%
      summarize(`e-cigarette_ever`= (mean(ecig_ever, na.rm = TRUE)*100),
            `non-e-cigarette_ever`= (mean(non_ecig_ever, na.rm = TRUE)*100)) %>%
    pivot_longer(cols = -year, names_to = "Category", values_to = "Percentage of students") %>%
    separate(Category, into = c("Product", "User"), sep = "_") %>%
    ggplot(aes(x=year,y=`Percentage of students`, color=Product)) +
    geom_line() +
  geom_point(show.legend = FALSE, size = 2) +
  scale_color_manual(values = v_colors) +
  scale_y_continuous(breaks = seq(0, 60, by = 10), limits = c(0,60)) +
    theme_linedraw() +
    theme(legend.position = "bottom",
          axis.title.x = element_blank()) +
    labs(title = "How does the rate of ever trying e-cigarettes \ncompare to ever trying other products over time?",
         y = "% of students")

plot7
```


Now we will do the same, but for students who reported currently using e-cigarettes or non-e-cigarette products. 


```{r}
v_colors =  viridis(6)[c(1,4)]

plot8 <- nyts_data %>%
    group_by(year) %>%
      summarize(`e-cigarette_current`= (mean(ecig_current, na.rm = TRUE)*100),
            `non-e-cigarette_current`= (mean(non_ecig_current, na.rm = TRUE)*100)) %>%
    pivot_longer(cols = -year, names_to = "Category", values_to = "Percentage of students") %>%
    separate(Category, into = c("Product", "User"), sep = "_") %>%
    ggplot(aes(x=year, y=`Percentage of students`, color=Product)) +
    geom_line(linetype = "dashed") +
  geom_point(show.legend = FALSE, size = 2) +
  scale_color_manual(values = v_colors) +
  scale_linetype_manual(values = c(1)) +
  scale_y_continuous(breaks = seq(0, 60, by = 10), limits = c(0,60)) +
    theme_linedraw() +
    theme(legend.position = "bottom",
          axis.title.x = element_blank()) +
    labs(title = "How does the rate of currently using e-cigarettes \ncompare to currently using other products over time?",
         y = "% of students")

plot8
```

### Putting plots together

Now we will put these plots together using the `plot_grid()` function of the `cowplot` package.  We will also modify the labels using the `ggdraw()` function, which is also part of the `cowplot` package.

```{r, fig.height=10}
plotA_uw <- plot1 +
  theme(axis.title.x = element_blank(),
        legend.position = "none") +
    labs(title = "Tobacco product users more prevalent after 2017",
         subtitle = NULL,
         y = "% of students")

plotB_uw <- plot7 +
  theme(axis.title.x = element_blank(),
        legend.position = "none") +
    labs(title = "% Ever trying e-cigarettes increases &\n% Ever trying other products decreases",
         subtitle = NULL,
         y = "% of students")

plotC_uw <- plot8 + 
  theme(axis.title.x = element_blank(),
        legend.position = "none") +
    labs(title = "% Currently using e-cigarettes increases &\n% Currently using other products decreases",
         subtitle = NULL,
         y = "% of students")

title_uw <- ggdraw() + 
  draw_label(
    "Have e-cigarette rates possibly influenced tobacco use?",
    fontface = 'bold',
    size=14,
    x = 0,
    hjust = 0
  ) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

plotsA_uw <- plot_grid(plotA_uw,
                     rel_widths = c(1,1))
plotsBC_uw <- plot_grid(plotB_uw,
                        plotC_uw,
                        rel_widths = c(1,1))

# this will take the legend from plot1c to use as the legend for the plot we are creating
legend_uw <- get_legend(plot1c +
                       theme(legend.position = "bottom",
                             legend.direction = "horizontal"))

figure_uw <- plot_grid(title_uw,
                       plotsA_uw,
                       plotsBC_uw,
                       legend_uw,
                       ncol = 1,
                       rel_heights = c(0.1,
                                       1,
                                       1,
                                       0.1),
                       scale = 1.0)

figure_uw
```


### Survey Weighting

It turns out that our analysis thus far has been brushing an important statistical concept under the rug, related to how our data were collected. Our data come from responses to a survey, which may have a particular sampling scheme to capture data about the population we are interested in. For example, the survey may be designed to capture a set of individuals who reflect the characteristics of the population that we are interested in drawing conclusions about. However, only a fraction of the individuals who were contacted about taking the survey may have completed it, and this fraction of individuals may no longer be representative of the population. Or the survey may be designed to over-sample a particular group of interest so that individuals from that group show up more often as survey respondents than are present in the population overall. In order to account for the fact that the survey respondents may not reflect the composition of the population we want to generalize to, we can emply a technique called [survey weighting](http://www.applied-survey-methods.com/weight.html){target="_blank"}.

Survey weighting is a common technique used in survey data analysis because often the individuals that take a survey are not necessarily representative of the population that we are trying to gather information about. For example, we may have more females that respond to the survey than males because perhaps female students were more willing to participate. In this case, the proportion of data values in our data will be smaller for the males than the proportion of actual male students and larger for the females than the true proportion of actual female students. To get a better estimate of overall e-cigarette smoking rates, the data from the males can be weighted based on the true proportion of male students to amplify the contribution of the responses from the males that did participate. Conversely, the female data can be weighted to diminish the contribution if their responses to the overall picture. We will see if using survey weighting changes the general trends that we see in our data. 

Calculating survey weights involves making a weight based on the ratio of the proportion of survey respondents from a particular group and the actual proportion of that group in the population. For example, let's say that females account for 50% of the population and males account for 50 % of the population. Let's also say that 75% of the respondents to the survey were female and only 25% were males. 

Then we could calculate survey weights using this formula:

$$ \frac{\text{actual proportion of group in the population}}{\text{ proportion of group in the respondents}}$$  

Thus the weight for the females would be calculated as:

$$ \frac{.5}{.75} = .67$$  

The weight for the males would be calculated as:

$$ \frac{.5}{.25} = 2$$

Therefore each male response value would be multiplied by a factor of 2 and would have twice the contribution, while the female response values would have only about 70% of the contribution that they would have had without weighting.

Note that survey weights are in reality corrected for other aspects - for example the response rate to individual questions.

We do not need to calculate survey weights for our data as they were already supplied in the data set, as described in the codebooks. 

#### `srvyr` package and survey design 

***

We will now use the `srvyr` package to evaluate our data using survey weights that were provided in the data for each year, as described in the respective codebooks. This package contains functions that allow the user to easily perform calculations from the data that take the survey design into account, without having to work out the math by hand.

Within the data you will see that we have three variables related to the survey sampling scheme:  `psu`, `finwgt`, and `stratum`. Details about these variables are available, for example, in the [2019 Methodology Report](./docs/2019-nyts-dataset-and-codebook-microsoft-excel/2019-nyts-methodology-report-p.pdf){target="_blank"}.

In brief they represent: 

1) `psu`: Surveys like the one used to create the data we are using often sample people based on strata. This is done to ensure that the responses are representative of the population of interest. Thus, often people first think about ensuring that surveys are conducted in a variety of geographical areas. This is often called the **primary sampling unit** or **PSU**. In [this survey](https://web.sph.harvard.edu/mch-data-connect/results/national-youth-tobacco-survey-nyts/){target="_blank"}, the county where the student's school was located was used as the PSU. 

2) `stratum`: A categorical variable that indicates subsets of the data that include respondents from different *PSUs*. In our case, strata are determined by the predominant minority in the PSU (Non-Hispanic Black or Hispanic), whether the PSU is urban or non-urban, and what percent of the students in the PSU fall into the predominant minority group. PSUs are allocated across the 16 possible strata according to the sampling scheme. These strata values allow estimates based on the survey responses to be calculated using different strata allowing for improved precision of the response estimates.

AVOCADO: I added a bunch of details here to the definition of stratum - perhaps give it a quick read. I also think it might be worth including one of the Methodology Reports, which is where I pulled this information from.

3) `finwgt`: The survey weight which was calculated based on a variety of factors.

This [link](https://web.sph.harvard.edu/mch-data-connect/results/national-youth-tobacco-survey-nyts/){target="_blank"} and this [link](https://osf.io/n7r32){target="_blank"} have more information about the study design of the data that we are using.

For detailed information on such survey designs in general see [here](http://www.asasrms.org/Proceedings/y2008/Files/301835.pdf){target="_blank"} and [here](http://ocw.jhsph.edu/courses/StatMethodsForSampleSurveys/PDFs/Lecture5.pdf){target="_blank"}.

We will use the `as_survey_design()` function of the `srvyr`package to create a survey object with a specified survey design. This is a special R object that includes information about how the survey was conducted that can be taken into account in the analysis.

There are several arguments to pay attention to:

1) The `strata` argument is used to specify the variable(s) that defined strata in the data. In this case, we will use the `stratum` variable.
2) The `ids` argument is used to define cluster ids within the data. In this case we will use the `psu` variable.
3) The `weight` argument is the  used to define which variable(s) are the survey weights.
4) The `nest = TRUE` argument, forces cluster ids (in this case the PSU) to be nested within the strata.

We can then use the `survey_mean()` function to calculate percentages of students who report using tobacco for each year while accounting for the survey design and weights. We will specify that we want [confidence interval](https://en.wikipedia.org/wiki/Confidence_interval) estimates by using the `vartype = "ci"` argument. The confidence intervals in our case give a range of possible values for the true population mean based on the data observed in the survey. We will multiply these values by 100 to get percentages. (Note: We could also have calculated confidence intervals for the unweighted results above by computing them by hand; we leave this as a potential exercise.)

Since the survey weights are specific to a single year of the survey results, we need to create survey design objects for each year separately. We will use `group_by` and `group_modify`, which is also from the `dplyr` package, to do this. We first write the function that we want to call on each group.

This function will take an input called `currYear`, which will be one set of survey responses for a specific year, and then craetes a survey design based on the `stratum` and `finwgt` specific to that year. It will then calculate the mean for student respondants who have ever tried any tobacco products or are a current user of any tobacco products accounting for the survey design and weights using `survey_mean()` as was just described. The function will then wrangle the data to produce new variables about the type of mean estimate that was given from the `survey_mean` output and the type of user. This 

### Weighted Sample
```{r}
surveyMeanA <- function(currYear){
  options(survey.lonely.psu = "average")
  currYear %>%
  as_survey_design(strata = stratum, 
                      ids = psu, 
                  weight  = finwgt, 
                     nest = TRUE) %>%
   summarize(tobacco_ever = survey_mean(tobacco_ever, 
                                        vartype = "ci", 
                                        na.rm = TRUE),
          tobacco_current = survey_mean(tobacco_current, 
                                        vartype = "ci", 
                                        na.rm = TRUE)) }


nyts_data %>% 
  group_by(year) %>%
  group_modify(~surveyMeanA(.x)) %>%
  head()



```

Now let's make the function wrangle the output in a more usable form too:
```{r}

surveyMeanA <- function(currYear){
  options(survey.lonely.psu = "average")
  currYear %>%
  as_survey_design(strata = stratum, 
                      ids = psu, 
                  weight  = finwgt, 
                     nest = TRUE) %>%
   summarize(tobacco_ever = survey_mean(tobacco_ever, 
                                        vartype = "ci", 
                                        na.rm = TRUE),
          tobacco_current = survey_mean(tobacco_current, 
                                        vartype = "ci", 
                                        na.rm = TRUE))  %>%
  mutate_all( "*", 100) %>%
    pivot_longer(everything(), 
                 names_to = "Type", 
                 values_to = "Percentage of students") %>%
  mutate(Estimate = case_when(str_detect(Type,"_low") ~ "Lower",
                              grepl("_upp", Type) ~ "Upper",
                          TRUE ~ "Mean"),
         User = case_when(grepl("ever", Type) ~ "Ever",
                          grepl("current", Type) ~ "Current",
                          TRUE ~ "Mean"))
  
}

nyts_data %>% 
  group_by(year) %>%
  group_modify(~surveyMeanA(.x))

```

```{r}
### AVOCADO: Remove rest of chunk from here if you are OK with this way of doing it
nyts_data %>%
  as_survey_design(strata = stratum, 
                      ids = psu, 
                  weight  = finwgt, 
                     nest = TRUE) %>%
          group_by(year) %>%
   summarize(tobacco_ever = survey_mean(tobacco_ever, 
                                        vartype = "ci", 
                                        na.rm = TRUE),
          tobacco_current = survey_mean(tobacco_current, 
                                        vartype = "ci", 
                                        na.rm = TRUE))  %>%
  mutate_at(vars(-year), "*", 100) %>%
    pivot_longer(cols = -year, 
                 names_to = "Type", 
                 values_to = "Percentage of students") %>%
  mutate(Estimate = case_when(grepl("_low", Type) ~ "Lower",
                              grepl("_upp", Type) ~ "Upper",
                          TRUE ~ "Mean"),
         User = case_when(grepl("ever", Type) ~ "Ever",
                          grepl("current", Type) ~ "Current",
                          TRUE ~ "Mean")) %>%
  head()


```

We will now make a plot using this data. The confidence intervals are included using the `geom_linerange()` function of the `ggplot2` package.

```{r}
### AVOCADO: Remove this first version of the figure if you are OK with my modifications below
plotA_w <- nyts_data %>%
  as_survey_design(strata = stratum, 
                      ids = psu, 
                  weight  = finwgt, 
                     nest = TRUE) %>%
          group_by(year) %>%
   summarize(tobacco_ever = survey_mean(tobacco_ever, 
                                        vartype = "ci", 
                                        na.rm = TRUE),
          tobacco_current = survey_mean(tobacco_current, 
                                        vartype = "ci", 
                                        na.rm = TRUE))  %>%
  mutate_at(vars(-year), "*", 100) %>%
    pivot_longer(cols = -year, 
                 names_to = "Type", 
                 values_to = "Percentage of students") %>%
  mutate(Estimate = case_when(grepl("_low", Type) ~ "Lower",
                              grepl("_upp", Type) ~ "Upper",
                                             TRUE ~ "Mean"),
         User = case_when(grepl("ever", Type) ~ "Ever",
                       grepl("current", Type) ~ "Current",
                                         TRUE ~ "Mean")) %>%
  dplyr::select(-Type) %>%
  pivot_wider(names_from = Estimate, 
              values_from = `Percentage of students`) %>%
  ggplot(aes(x=year,y=Mean)) +
  geom_line(aes(linetype=User)) +
  geom_linerange(aes(ymin = Lower, ymax = Upper), size = 1, show.legend = FALSE) +
  scale_linetype_manual(values = c(2,1)) +
  scale_color_manual(values = v_colors) +
  scale_y_continuous(breaks = seq(0,70,by=10),
                       labels = seq(0,70,by=10),
                       limits = c(0,70)) +
    theme_linedraw() +
    theme(legend.position = "none",
          axis.title.x = element_blank()) +
    labs(title = "Tobacco product users more prevalent after 2017",
         y = "% of students")
plotA_w

## AVOCADO: updated version using group_modify
plotA_w <- nyts_data %>% 
  group_by(year) %>%
  group_modify(~surveyMeanA(.x)) %>%
  dplyr::select(-Type) %>%
  pivot_wider(names_from = Estimate, 
              values_from = `Percentage of students`) %>%
  ggplot(aes(x=year,y=Mean)) +
  geom_line(aes(linetype=User)) +
  geom_linerange(aes(ymin = Lower, ymax = Upper), size = 1, show.legend = FALSE) +
  scale_linetype_manual(values = c(2,1)) +
  scale_color_manual(values = v_colors) +
  scale_y_continuous(breaks = seq(0,70,by=10),
                       labels = seq(0,70,by=10),
                       limits = c(0,70)) +
    theme_linedraw() +
    theme(legend.position = "none",
          axis.title.x = element_blank()) +
    labs(title = "Tobacco product users more prevalent after 2017",
         y = "% of students")
plotA_w

```

Now we can see that we have confidence interval ranges plotted for each value. 

We will make a similar plot for students who reported ever trying or who currently use e-cigarettes as opposed to tobacco in general.

```{r}
surveyMeanB <- function(currYear){
  options(survey.lonely.psu = "average")
  currYear %>%
  as_survey_design(strata = stratum, 
                      ids = psu, 
                  weight  = finwgt, 
                     nest = TRUE) %>%
  summarize(ecig_ever_year = survey_mean(ecig_ever, vartype = "ci", na.rm=TRUE),
            non_ecig_ever_year = survey_mean(non_ecig_ever, vartype = "ci", na.rm=TRUE)) %>%  
  mutate_all("*", 100) %>%
  pivot_longer(everything(), 
           names_to = "Category", 
          values_to = "Percentage of students") %>%
    mutate(Estimate = case_when(grepl("_low", Category) ~ "Lower",
                          grepl("_upp", Category) ~ "Upper",
                          TRUE ~ "Mean"),
         User = case_when(grepl("ever", Category) ~ "Ever",
                          grepl("current", Category) ~ "Current"),
         Product = case_when(grepl("non_ecig", Category) ~ "Other products",
                          TRUE ~ "E-cigarettes")) %>%
  dplyr::select(-Category) %>%
  pivot_wider(names_from = Estimate, 
              values_from = `Percentage of students`)

}

nyts_data %>% 
  group_by(year) %>%
  group_modify(~surveyMeanB(.x)) %>%
  head()

## AVOCADO: Drop this version of code
nyts_data %>%
  as_survey_design(strata = stratum, ids = psu, weight  = finwgt, nest=TRUE) %>%
    group_by(year) %>%
      summarize(ecig_ever_year = survey_mean(ecig_ever, vartype = "ci", na.rm=TRUE),
            non_ecig_ever_year = survey_mean(non_ecig_ever, vartype = "ci", na.rm=TRUE)) %>%  mutate_at(vars(-year), "*", 100) %>%
  pivot_longer(cols = -year, 
           names_to = "Category", 
          values_to = "Percentage of students") %>%
    mutate(Estimate = case_when(grepl("_low", Category) ~ "Lower",
                          grepl("_upp", Category) ~ "Upper",
                          TRUE ~ "Mean"),
         User = case_when(grepl("ever", Category) ~ "Ever",
                          grepl("current", Category) ~ "Current"),
         Product = case_when(grepl("non_ecig", Category) ~ "Other products",
                          TRUE ~ "E-cigarettes")) %>%
  dplyr::select(-Category) %>%
  pivot_wider(names_from = Estimate, 
              values_from = `Percentage of students`) %>%
  head()


## AVOCADO: Drop this version of plot
plotB_w <- nyts_data %>%
  as_survey_design(strata = stratum, ids = psu, weight  = finwgt, nest=TRUE) %>%
    group_by(year) %>%
summarize(ecig_ever_year = survey_mean(ecig_ever, vartype = "ci", na.rm=TRUE),
            non_ecig_ever_year = survey_mean(non_ecig_ever, vartype = "ci", na.rm=TRUE)) %>%  mutate_at(vars(-year), "*", 100) %>%
  pivot_longer(cols = -year, 
           names_to = "Category", 
          values_to = "Percentage of students") %>%
    mutate(Estimate = case_when(grepl("_low", Category) ~ "Lower",
                          grepl("_upp", Category) ~ "Upper",
                          TRUE ~ "Mean"),
         User = case_when(grepl("ever", Category) ~ "Ever",
                          grepl("current", Category) ~ "Current"),
         Product = case_when(grepl("non_ecig", Category) ~ "Other products",
                          TRUE ~ "E-cigarettes")) %>%
  dplyr::select(-Category) %>%
  pivot_wider(names_from = Estimate, 
              values_from = `Percentage of students`) %>%
  ggplot(aes(x=year,y=Mean, color = Product)) +
  geom_line() +
  geom_linerange(aes(ymin = Lower, ymax = Upper), size = 1, show.legend = FALSE) +
  scale_linetype_manual(values = c(2,1)) +
  scale_color_manual(values = v_colors) +
  scale_y_continuous(breaks = seq(0,60,by=10),
                       labels = seq(0,60,by=10),
                       limits = c(0,60)) +
    theme_linedraw() +
    theme(legend.position = "none",
          axis.title.x = element_blank()) +
    labs(title = "% Ever trying e-cigarettes increases &\n% Ever trying other products decreases",
         y = "% of students")

plotB_w

plotB_w <- nyts_data %>% 
  group_by(year) %>%
  group_modify(~surveyMeanB(.x)) %>%
  ggplot(aes(x=year,y=Mean, color = Product)) +
  geom_line() +
  geom_linerange(aes(ymin = Lower, ymax = Upper), size = 1, show.legend = FALSE) +
  scale_linetype_manual(values = c(2,1)) +
  scale_color_manual(values = v_colors) +
  scale_y_continuous(breaks = seq(0,60,by=10),
                       labels = seq(0,60,by=10),
                       limits = c(0,60)) +
    theme_linedraw() +
    theme(legend.position = "none",
          axis.title.x = element_blank()) +
    labs(title = "% Ever trying e-cigarettes increases &\n% Ever trying other products decreases",
         y = "% of students")

plotB_w

```


Now we will do the same but for current users:

```{r}
surveyMeanC <- function(currYear){
  options(survey.lonely.psu = "average")
  currYear %>%
  as_survey_design(strata = stratum, 
                      ids = psu, 
                  weight  = finwgt, 
                     nest = TRUE) %>%
       summarize(ecig_current_year = survey_mean(ecig_current, vartype = "ci", na.rm=TRUE),
            non_ecig_current_year = survey_mean(non_ecig_current, vartype = "ci", na.rm=TRUE)) %>%
  mutate_all("*", 100) %>%
  pivot_longer(everything(), 
           names_to = "Category", 
          values_to = "Percentage of students") %>%
    mutate(Estimate = case_when(grepl("_low", Category) ~ "Lower",
                          grepl("_upp", Category) ~ "Upper",
                          TRUE ~ "Mean"),
         User = case_when(grepl("ever", Category) ~ "Ever",
                          grepl("current", Category) ~ "Current"),
         Product = case_when(grepl("non_ecig", Category) ~ "Other products",
                          TRUE ~ "E-cigarettes")) %>%
  dplyr::select(-Category) %>%
  pivot_wider(names_from = Estimate, 
              values_from = `Percentage of students`)

}

#AVOCADO: Drop this version of plot code
plotC_w <- nyts_data %>%
  as_survey_design(strata = stratum, ids = psu, weight  = finwgt, nest=TRUE) %>%
    group_by(year) %>%
       summarize(ecig_current_year = survey_mean(ecig_current, vartype = "ci", na.rm=TRUE),
            non_ecig_current_year = survey_mean(non_ecig_current, vartype = "ci", na.rm=TRUE)) %>%
       mutate_at(vars(-year), "*", 100) %>%
       pivot_longer(cols = -year,
                    names_to = "Category", 
                    values_to = "Percentage of students") %>%
       mutate(Estimate = case_when(grepl("_low", Category) ~ "Lower",
                          grepl("_upp", Category) ~ "Upper",
                          TRUE ~ "Mean"),
         User = case_when(grepl("ever", Category) ~ "Ever",
                          grepl("current", Category) ~ "Current"),
         Product = case_when(grepl("non_ecig", Category) ~ "Other products",
                          TRUE ~ "E-cigarettes")) %>%
  dplyr::select(-Category) %>%
  pivot_wider(names_from = Estimate, 
              values_from = `Percentage of students`) %>%
    ggplot(aes(x=year,y=Mean, color=Product)) +
  geom_line(aes(linetype="dashed")) +
  geom_linerange(aes(ymin = Lower, ymax = Upper), size = 1,  show.legend = FALSE) +
  scale_linetype_manual(values = c(2,1)) +
  scale_y_continuous(breaks = seq(0, 60, by = 10), limits = c(0,60)) +
  scale_color_manual(values = v_colors) +
    theme_linedraw() +
    theme(legend.position = "none",
          axis.title.x = element_blank()) +
    labs(title = "% Currently using e-cigarettes increases &\n% Currently using other products decreases",
         y = "% of students")
plotC_w

plotC_w <- nyts_data %>% 
  group_by(year) %>%
  group_modify(~surveyMeanC(.x)) %>%
    ggplot(aes(x=year,y=Mean, color=Product)) +
  geom_line(aes(linetype="dashed")) +
  geom_linerange(aes(ymin = Lower, ymax = Upper), size = 1,  show.legend = FALSE) +
  scale_linetype_manual(values = c(2,1)) +
  scale_y_continuous(breaks = seq(0, 60, by = 10), limits = c(0,60)) +
  scale_color_manual(values = v_colors) +
    theme_linedraw() +
    theme(legend.position = "none",
          axis.title.x = element_blank()) +
    labs(title = "% Currently using e-cigarettes increases &\n% Currently using other products decreases",
         y = "% of students")
plotC_w


```


Now we will put these plots together again using the `cowplot` package:

```{r}
title_w <- ggdraw() + 
  draw_label(
    expression("What is the relationship between vaping rates and tobacco use?"),
    fontface = 'bold',
    size=14,
    x = 0,
    hjust = 0
  ) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

plotsA_w <- plot_grid(plotA_w,
                     rel_widths = c(1),
                     align = "v",
                     axis = "bt")
plotsBC_w <- plot_grid(plotB_w,
                     plotC_w,
                     rel_widths = c(1,1),
                     align = "v",
                     axis = "bt")

legend_w <- get_legend(plot1c +
                       theme(legend.position = "bottom",
                             legend.direction = "horizontal"))

figure_w <- plot_grid(title_w,
                      plotsA_w,
                      plotsBC_w,
                      legend_w,
                      ncol = 1,
                      rel_heights = c(0.1,
                                      1,
                                      1,
                                      0.1),
                      scale = 1.0)

figure_w
```

We can see that these figures look quite similar to the ones generated without using the survey weights.

### Artificial Cohort

Although the survey design does not allow specific individuals to be followed over time, we will use certain subsets of the data from each year to construct an artificial cohort where we follow students of the same age group as they get older. This will allow us to look at how tobacco usage changed for students who were in 8th grade in 2015 as they aged. 

All of the data so far has included all 6th-12th graders every year. Now we will look at just the data for students expected to graduate in 2019. These are the students who were in 8th grade in 2015, most of whom were 9th graders in 2016, 10th graders in 2017 and so on. We will filter the data to just the students expected to be in the graduating class of 2019.


```{r}

surveyMeanCohort <- function(currYear){
  options(survey.lonely.psu = "average")
  currYear %>%
  as_survey_design(strata = stratum, 
                      ids = psu, 
                  weight  = finwgt, 
                     nest = TRUE) %>%
summarize(ecig_ever_year = survey_mean(ecig_ever, vartype = "ci", na.rm=TRUE),
            ecig_current_year = survey_mean(ecig_current, vartype = "ci", na.rm=TRUE),
            non_ecig_ever_year = survey_mean(non_ecig_ever, vartype = "ci", na.rm=TRUE),
            non_ecig_current_year = survey_mean(non_ecig_current, vartype = "ci", na.rm=TRUE),
          tobacco_ever_year = survey_mean(tobacco_ever, vartype = "ci", na.rm=TRUE),
            tobacco_current_year = survey_mean(tobacco_current, vartype = "ci", na.rm=TRUE))%>%
  mutate_all("*", 100) %>%
  pivot_longer(everything(), 
           names_to = "Category", 
          values_to = "Percentage of students") %>%
    mutate(Estimate = case_when(grepl("_low", Category) ~ "Lower",
                          grepl("_upp", Category) ~ "Upper",
                          TRUE ~ "Mean"),
         User = case_when(grepl("ever", Category) ~ "Ever",
                          grepl("current", Category) ~ "Current"),
         Product = case_when(grepl("non_ecig", Category) ~ "Other products",
                             grepl("tobacco", Category) ~ "Any tobacco product",
                          TRUE ~ "E-cigarettes")) %>%
  dplyr::select(-Category) %>%
  pivot_wider(names_from = Estimate, 
              values_from = `Percentage of students`)

}

Cohort_data <-nyts_data %>%
  filter((Grade == "8" & year == 2015) |
         (Grade == "9" & year == 2016) |
         (Grade == "10" & year == 2017) |
         (Grade == "11" & year == 2018) |
          (Grade == "12" & year == 2019) 
         ) %>%
  as_survey_design(strata = stratum, ids = psu, weight  = finwgt) %>%
    group_by(year) %>%
summarize(ecig_ever_year = survey_mean(ecig_ever, vartype = "ci", na.rm=TRUE),
            ecig_current_year = survey_mean(ecig_current, vartype = "ci", na.rm=TRUE),
            non_ecig_ever_year = survey_mean(non_ecig_ever, vartype = "ci", na.rm=TRUE),
            non_ecig_current_year = survey_mean(non_ecig_current, vartype = "ci", na.rm=TRUE),
          tobacco_ever_year = survey_mean(tobacco_ever, vartype = "ci", na.rm=TRUE),
            tobacco_current_year = survey_mean(tobacco_current, vartype = "ci", na.rm=TRUE))%>%
  mutate_at(vars(-year), "*", 100) %>%
  pivot_longer(cols = -year, names_to = "Category", values_to = "Percentage of students")%>%
    mutate(Estimate = case_when(grepl("_low", Category) ~ "Lower",
                          grepl("_upp", Category) ~ "Upper",
                          TRUE ~ "Mean"),
         User = case_when(grepl("ever", Category) ~ "Ever",
                          grepl("current", Category) ~ "Current"),
         Product = case_when(grepl("non_ecig", Category) ~ "Other products",
                             grepl("tobacco", Category) ~ "Any tobacco product",
                          TRUE ~ "E-cigarettes")) %>%
  dplyr::select(-Category) %>%
  pivot_wider(names_from = Estimate, 
              values_from = `Percentage of students`)

head(Cohort_data)


Cohort_data <-nyts_data %>%
  filter((Grade == "8" & year == 2015) |
         (Grade == "9" & year == 2016) |
         (Grade == "10" & year == 2017) |
         (Grade == "11" & year == 2018) |
          (Grade == "12" & year == 2019) 
         ) %>%
  group_by(year) %>%
 group_modify(~surveyMeanCohort(.x))

head(Cohort_data)

```

We will now make similar plots to those above for this subset of the data:

```{r}
plotA_w_8 <-Cohort_data %>%
  filter(Product == "Any tobacco product") %>%
  ggplot(aes(x=year,y=Mean)) +
  geom_line(aes(linetype=User)) +
  geom_linerange(aes(ymin = Lower, ymax = Upper), size = 1) + 
  scale_linetype_manual(values = c(2,1)) +
    scale_y_continuous(breaks = seq(0,70,by=10),
                       labels = seq(0,70,by=10),
                       limits = c(0,70)) +
  scale_color_manual(values = v_colors) +
    theme_linedraw() +
    theme(legend.position = "none",
          axis.title.x = element_blank()) +
    labs(title = "Tobacco product use became increasingly prevalent",
         y = "% of students")

plotB_w_8 <-Cohort_data %>%
  filter(Product != "Any tobacco product", User == "Ever") %>%
  ggplot(aes(x=year,y=Mean, color=Product)) +
  geom_line(linetype=1) +
  geom_linerange(aes(ymin = Lower, ymax = Upper), size = 1) + 
  scale_y_continuous(breaks = seq(10, 60, by = 10), limits = c(10,60)) +
  scale_color_manual(values = v_colors) +
    theme_linedraw() +
    theme(legend.position = "none",
          axis.title.x = element_blank()) +
    labs(title = "% ever trying tobacco products increases",
         y = "% of students")

plotC_w_8 <-Cohort_data %>%
  filter(Product != "Any tobacco product", User == "Current") %>%
    ggplot(aes(x=year,y=Mean, color=Product)) +
  geom_line(aes(linetype=User)) +
  geom_linerange(aes(ymin = Lower, ymax = Upper), size = 1) + 
  scale_linetype_manual(values = c(2,1)) +
  scale_y_continuous(breaks = seq(0, 60, by = 10), limits = c(0,60)) +
  scale_color_manual(values = v_colors) +
    theme_linedraw() +
    theme(legend.position = "none",
          axis.title.x = element_blank()) +
    labs(title = "E-cigarette use surpasses use of other products",
         y = "% of students")

title_w_8 <- ggdraw() + 
  draw_label(
    expression("Among students expected to be in the graduating class of 2019, how are vaping rates related to tobacco use?"),
    fontface = 'bold',
    size=14,
    x = 0,
    hjust = 0
  ) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

plotsA_w_8 <- plot_grid(plotA_w_8,
                        rel_widths = c(1),
                        align = "v",
                        axis = "bt")

plotsBC_w_8 <- plot_grid(plotB_w_8,
                         plotC_w_8,
                         rel_widths = c(1,1),
                         axis = "bt")

legend_w_8 <- get_legend(plot1c +
                       theme(legend.position = "bottom",
                             legend.direction = "horizontal"))

figure_w_8 <- plot_grid(title_w_8,
                        plotsA_w_8,
                        plotsBC_w_8,
                        legend_w_8,
                        ncol = 1,
                        rel_heights = c(0.1,
                                      1,
                                      1,
                                      0.1),
                        scale = 1.0
)

figure_w_8
```

### Final Figure

Finally we will put our plots together to create a plot that describes the relationship between e-cigarette usage and overall tobacco use, combining both our first set of unweighted results, and those calculated using the `srvyr` package.

```{r}
title_final <- ggdraw() +
  draw_label(
    expression("What is the relationship between vaping rates and tobacco use?"),
    fontface = 'bold',
    size=16,
    x = 0.5) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

subtitle_uw_final <- ggdraw() + 
  draw_label(
    expression(~6^th~"-"~12^th~"graders, unweighted"),
    size=12,
    x = 0.5) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

subtitle_w_final <- ggdraw() + 
  draw_label(
    expression(~6^th~"-"~12^th~"graders, weighted"),
    fontface = 'bold',
    size=12,
    x = 0.5) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

subtitle_w_8_final <- ggdraw() + 
  draw_label(
    expression("Approximate graduating \n class of 2019, weighted"),
    fontface = 'bold',
    size=12,
    x = 0.5) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

subtitle_final <- plot_grid(subtitle_uw_final,
                            subtitle_w_final,
                            subtitle_w_8_final,
                            ncol = 3)

plotsA_title_final <- ggdraw() + 
  draw_label(
    expression("Prevalence of any tobacco product use by user type"),
    size=14,
    x = 0.5) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

plotsA_final <- plot_grid(plotA_uw + theme(plot.title = element_blank()),
                          plotA_w + theme(plot.title = element_blank()),
                          plotA_w_8 + theme(plot.title = element_blank()),
                          ncol = 3,
                          align = "v",
                          axis = "bt")

plotsB_title_final <- ggdraw() + 
  draw_label(
    expression("Prevalence of ever trying by product type"),
    size=14,
    x = 0.5) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

plotsB_final <- plot_grid(plotB_uw + theme(plot.title = element_blank()),
                          plotB_w + theme(plot.title = element_blank()),
                          plotB_w_8 + theme(plot.title = element_blank()),
                          ncol = 3,
                          align = "v",
                          axis = "bt")

plotsC_title_final <- ggdraw() + 
  draw_label(
    expression("Prevalence of current use by product type"),
    size=14,
    x = 0.5) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

plotsC_final <- plot_grid(plotC_uw + theme(plot.title = element_blank()),
                          plotC_w + theme(plot.title = element_blank()),
                          plotC_w_8 + theme(plot.title = element_blank()),
                          ncol = 3,
                          align = "v",
                          axis = "bt")

legend_final <- get_legend(plot1c +
                             theme(legend.position = "bottom",
                             legend.direction = "horizontal"))

final_plot <- plot_grid(title_final,
          plotsA_title_final,
          subtitle_final,
          plotsA_final,
          plotsB_title_final,
          subtitle_final,
          plotsB_final,
          plotsC_title_final,
          subtitle_final,
          plotsC_final,
          legend_final,
          ncol = 1,
          rel_heights = c(0.2,
                          0.2,
                          0.1,
                          1,
                          0.2,
                          0.1,
                          1,
                          0.2,
                          0.1,
                          1,
                          0.1))

final_plot
```

```{r, echo=FALSE, include=FALSE}
ggsave(here::here("img", "final_plot.png"))
```

## **Suggested Homework**
*** 

<style>
div.blue { background-color:#e6f0ff; border-radius: 5px; padding: 20px;}
</style>
<div class = "blue">

+ Calculate confidence intervals for the unweighted estimates and add the appropriate error bars to the main figures.
+ Apply survey weights to one of the figures produced in this case study in which weighted estimates were not produced. Include error bars in the updated figure.
    + Does the figure change after the application of survey weights?
    + If so, describe how. 
+ Reproduce `final_plot` above for a different cohort of your choice.

</div>


## **Data Analysis**
*** 

## **Summary**
*** 
 cool visualization that summarizes everything... could add this somewhere: https://www.fda.gov/tobacco-products/youth-and-tobacco/youth-tobacco-use-results-national-youth-tobacco-survey


## **Helpful Links**
*** 


## **Session info**
***

```{r}
sessionInfo()
```